theoretical and computational studies of bayesian …quan/papers/thesis.pdf · biology &...

THEORETICAL AND COMPUTATIONAL STUDIES

OF BAYESIAN LINEAR MODELS

A Dissertation Submitted to the Faculty of

The Graduate School

Baylor College of Medicine

In Partial Fulfillment of the

Requirements for the Degree

of

Doctor of Philosophy

by

QUAN ZHOU

Houston, Texas

March 16, 2017

APPROVED BY THE DISSERTATION COMMITTEE

Signed

Yongtao Guan, Ph.D., Chairman

Rui Chen, Ph.D.

Dennis Cox, Ph.D.

Chris Man, Ph.D.

Michael Schweinberger, Ph.D.

APPROVED BY THE STRUCTURAL AND COMPUTATIONAL

BIOLOGY & MOLECULAR BIOPHYSICS GRADUATE PROGRAM

Signed

Aleksandar Milosavljevic, Ph.D., Director of SCBMB

APPROVED BY THE INTERIM DEAN OF

GRADUATE BIOMEDICAL SCIENCES

Signed

Adam Kuspa, Ph.D.

Date

2

AcknowledgementsFirst of all, I want to express my deepest gratitude to my advisor Dr. Yongtao

Guan and all the other members of our group for their tremendous help in every

aspect of my life throughout the past 5 years. The completion of my degree

projects should be attributed to Dr. Guan’s amazing ideas, selfless support and

unceasing hard work. From him I learned programming and statistical skills,

enthusiasm and devotion to science and how to conduct academic research. My

sincere thanks also go to other group members, Hang Dai, Zhihua Qi, Hanli Xu

and Liang Zhao, for their kindness, friendship and encouragement.

Second, I want to thank my thesis committee, Dr. Rui Chen, Dr. Dennis

Cox, Dr. Chris Man and Dr. Michael Schweinberger, for their amiable, generous

and continuous academic help and guidance. Special thanks to Dr. Cox for his

excellent courses (Mathematical Statistics I and II, Stochastic Process, Multivari-

ate Analysis, Functional Data Analysis) which opened a new world for me. In

addition, I want to thank Dr. Philip Ernst, who has collaborated with me on a

probability paper and offered me priceless opportunities like giving lectures for

the course Mathematical Probability.

Third, I am truly grateful to my program SCBMB and the Graduate School of

BCM, which always support me in choosing my research and studying the skills

that I like. Without the learning environment they have created almost nothing

could I have achieved. I also want to thank Rice University for the courses I took

over four years and its hospitality to a visiting student like me.

Last, but by no means least, I want to thank my parents, my friends and

teachers at and out of Houston as they have filled my PhD journey with hope and

happiness.

Dedication This thesis is dedicated to my family, especially my grandfather

and grandmother who have passed away during my PhD study. To their love I

shall be immensely and forever indebted.

3

Abstract

Statistical methods have been extensively applied to genome-wide association

studies to demystify the genetic architecture of many common complex diseases in

the past decade. Bayesian methods, though not as popular as traditional methods,

have been used for various purposes, like association testing, causal SNP identi-

fication, heritability estimation and genotype imputation. This work focuses on

the Bayesian methods based on linear regression.

Bayesian hypothesis testing reports a (null-based) Bayes factor instead of a

p-value. For linear regression, it is shown in Chap. 2.1 that under the null model

of no effect, 2 log(Bayes factor) is asymptotically distributed as a weighted sum of

independent χ21 random variables. The weights are all between 0 and 1. Similarly,

under the alternative model with some necessary conditions on the effect size,

2 log(Bayes factor) is asymptotically distributed as a weighted sum of indepen-

dent noncentral chi-squared random variables. An immediate benefit is that the

p-values associated with the Bayes factors can be analytically computed rather

than by permutation, which is of vital importance in genome-wide association

studies. Due to multiple testing, in whole-genome studies the significance thresh-

old is extremely small and thus permutation is in fact impractical. Furthermore,

the asymptotic results help explain the behaviour of the Bayes factor and the

origin of some well-known paradoxes, like Bartlett’s paradox (Chap. 2.2). Lastly,

in light of this null distribution, a new statistic named the scaled Bayes factor is

proposed. It is defined via a rescaling of the Bayes factor so that the expectation

of log(scaled Bayes factors) is fixed to zero (or some other constant). In Chap. 5.1

its practical and theoretical benefits are discussed. Chap. 5.2 describes an appli-

cation of the scaled Bayes factor to the analysis of a real whole-genome dataset

for intraocular pressure.

For multi-linear regression, the computation of the p-value associated with the

Bayes factor requires the evaluation of the distribution function of a weighted sum

4

of independent χ21 random variables. We implemented in C++ a recent polynomial

method of Bausch [2013], which appears to be the most efficient solution so far

(Chap. 2.3.2 and 2.3.3). Simulation studies (Chap. 2.3.4) show that the p-values

computed according to the asymptotic null distribution have very good calibration,

even for very large Bayes factors, validating the use of this method in genome-wide

association studies.

The expression of the Bayes factor for linear regression contains the posterior

mean estimator for the regression coefficient, which is also called the ridge esti-

mator by non-Bayesians. When X tX is available (X denotes the design matrix),

ridge estimators are usually computed via the Cholesky decomposition of the ma-

trix X tX + cI, which is efficient but still has cubic complexity in the number of

regressors. A new iterative method, called ICF (iterative solutions using complex

factorization), is proposed in Chap. 3. It assumes that the Cholesky decomposi-

tion of X tX is already obtained. Simulation (Chap. 3.5) shows that, when ICF is

applicable, it is much better than the Cholesky decomposition and other iterative

methods like Gauss-Seidel algorithm.

The ICF algorithm fits perfectly with the Bayesian variable selection regres-

sion proposed by Guan and Stephens [2011] since in MCMC, the Cholesky de-

composition of X tX can be obtained by efficient updating algorithms (but not

for X tX + cI if c is changing). A reimplementation of their method using ICF

substantially improves the efficiency of posterior inferences (Chap. 4.3). Sim-

ulation studies (Chap. 4.4) show that the new method can efficiently estimate

the heritability of a quantitative trait and report well-calibrated posterior inclu-

sion probabilities. Furthermore, compared with another popular software package

GCTA (Chap. 7.5), the new method has much better performance in prediction

(Chap. 4.4.3).

The use of the scaled Bayes factor for variable selection is discussed in Chap. 5.3.

To achieve consistency, the scaling factor is calibrated using the data (Chap. 5.3.1).

5

Simulation studies demonstrate that, after the calibration, the scaled Bayes fac-

tor performs at least as well as the unscaled Bayes factor in both heritability

estimation and prediction (Chap. 5.4).

6

Contents

Approvals 2

Acknowledgements 3

Abstract 4

Symbols and Notations 13

Abbreviations 14

1 Introduction 15

1.1 Genome-Wide Association Studies . . . . . . . . . . . . . . . . . . . 15

1.1.1 Some Genetic Concepts . . . . . . . . . . . . . . . . . . . . 16

1.1.2 Some Statistical Concepts . . . . . . . . . . . . . . . . . . . 20

1.2 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 25

1.3 Applications of Bayesian Linear Regression to GWAS . . . . . . . . 28

1.3.1 Association Testing . . . . . . . . . . . . . . . . . . . . . . . 28

1.3.2 Variable Selection, Heritability Estimation and Prediction . 30

2 Distribution and P-value of the Bayes Factor 34

2.1 Distribution of Bayes Factors in Linear Regression . . . . . . . . . . 34

2.1.1 Distributions of Quadratic Forms . . . . . . . . . . . . . . . 35

2.1.2 Asymptotic Distributions of log BFnull . . . . . . . . . . . . . 39

2.1.3 Asymptotic Results in Presence of Confounding Covariates . 43

2.2 Properties of the Bayes Factor and Its P-value . . . . . . . . . . . . 45

2.2.1 Comparison with the P-values of the Frequentists’ Tests . . 45

7

2.2.2 Independent Normal Prior and Zellner’s g-prior . . . . . . . 47

2.2.3 Behaviour of the Bayes Factor and Three Paradoxes . . . . . 49

2.2.4 Behaviour of the P-value Associated with the Bayes Factor . 54

2.2.5 More about Simple Linear Regression . . . . . . . . . . . . . 56

2.3 Computation of the P-values Associated with Bayes Factors . . . . 59

2.3.1 Bartlett-type Correction . . . . . . . . . . . . . . . . . . . . 60

2.3.2 Bausch’s Method . . . . . . . . . . . . . . . . . . . . . . . . 62

2.3.3 Implementation of Bausch’s Method . . . . . . . . . . . . . 66

2.3.4 Calibration of the P-values . . . . . . . . . . . . . . . . . . . 69

3 A Novel Algorithm for Computing Ridge Estimators 72

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.2 Direct Methods for Computing Ridge Estimators . . . . . . . . . . 74

3.2.1 Spectral Decomposition of X tX . . . . . . . . . . . . . . . . 75

3.2.2 Cholesky Decomposition of X tX + Σ . . . . . . . . . . . . 76

3.2.3 QR Decomposition of the Block Matrix[X t Σ1/2

]t. . . . . 77

3.2.4 Bidiagonalization Methods . . . . . . . . . . . . . . . . . . . 78

3.3 Iterative Methods for Computing Ridge Estimators . . . . . . . . . 79

3.3.1 Jacobi, Gauss-Seidel and Successive Over-Relaxation . . . . 79

3.3.2 Steepest Descent and Conjugate Gradient . . . . . . . . . . 82

3.4 A Novel Iterative Method Using Complex Factorization . . . . . . . 84

3.4.1 ICF and Its Convergence Properties . . . . . . . . . . . . . . 84

3.4.2 Tuning the Relaxation Parameter for ICF . . . . . . . . . . 88

3.5 Performance Comparison by Simulation . . . . . . . . . . . . . . . . 92

3.5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.5.2 Wall-time Usage, Convergence Rate and Accuracy . . . . . . 93

4 Bayesian Variable Selection Regression 98

4.1 Background and Literature Review . . . . . . . . . . . . . . . . . . 98

4.1.1 Models for Bayesian Variable Selection . . . . . . . . . . . . 100

4.1.2 Methods for the Model Fitting . . . . . . . . . . . . . . . . . 103

8

4.2 The BVSR Model of Guan and Stephens . . . . . . . . . . . . . . . 105

4.2.1 Model and Prior . . . . . . . . . . . . . . . . . . . . . . . . 106

4.2.2 MCMC Implementation . . . . . . . . . . . . . . . . . . . . 109

4.3 A Fast Novel MCMC Algorithm for BVSR using ICF . . . . . . . . 114

4.3.1 The Exchange Algorithm . . . . . . . . . . . . . . . . . . . . 115

4.3.2 Updating of the Cholesky Decomposition . . . . . . . . . . . 116

4.3.3 Summary of fastBVSR Algorithm . . . . . . . . . . . . . . . 118

4.4 GWAS Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.4.1 Posterior Inference for the Heritability . . . . . . . . . . . . 120

4.4.2 Calibration of Posterior Inclusion Probabilities . . . . . . . . 123

4.4.3 Prediction Performance . . . . . . . . . . . . . . . . . . . . . 125

4.4.4 Wall-time Usage . . . . . . . . . . . . . . . . . . . . . . . . 129

5 Scaled Bayes Factors 132

5.1 Motivations for Scaled Bayes Factors . . . . . . . . . . . . . . . . . 132

5.2 An Application to Intraocular Pressure GWAS Datatsets . . . . . . 135

5.3 Scaled Bayes Factors in Variable Selection . . . . . . . . . . . . . . 139

5.3.1 Calibrating the Scaling Factors . . . . . . . . . . . . . . . . 139

5.3.2 Prediction Properties . . . . . . . . . . . . . . . . . . . . . . 144

5.4 Simulation Studies for Variable Selection . . . . . . . . . . . . . . . 147

6 Summary and Future Directions 152

6.1 Summary of This Work . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.2 Specific Aims for Future Studies . . . . . . . . . . . . . . . . . . . . 154

6.2.1 Bayesian Association Tests Based on Haplotype or LocalAncestry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6.2.2 Application of ICF to Variational Methods for Variable Se-lection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.2.3 Extension of This Work to Categorical Phenotypes . . . . . 163

7 Appendices 167

7.1 Linear Algebra Results . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.1.1 Some Matrix Identities . . . . . . . . . . . . . . . . . . . . . 167

9

7.1.2 Singular Value Decomposition and Pseudoinverse . . . . . . 169

7.1.3 Eigenvalues, Eigenvectors and Eigendecomposition . . . . . . 171

7.1.4 Orthogonal Projection Matrices . . . . . . . . . . . . . . . . 175

7.2 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 179

7.2.1 Posterior Distributions for the Conjugate Priors . . . . . . . 180

7.2.2 Bayes Factors for Bayesian Linear Regression . . . . . . . . 182

7.2.3 Controlling for Confounding Covariates . . . . . . . . . . . . 185

7.3 Big-O and Little-O Notations . . . . . . . . . . . . . . . . . . . . . 190

7.4 Distribution of a Weighted Sum of χ21 Random Variables . . . . . . 192

7.4.1 Davies’ Method for Computing the Distribution Function . . 192

7.4.2 Methods for Computing the Bounds for the P-values . . . . 195

7.5 GCTA and Linear Mixed Model . . . . . . . . . . . . . . . . . . . . 197

7.5.1 Restricted Maximum Likelihood Estimation . . . . . . . . . 198

7.5.2 Newton-Raphson’s Method for Computing REML Estimates 200

7.5.3 Details of GCTA’s Implementation of REML Estimations . . 202

7.6 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . 205

7.7 Used Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

7.7.1 Merged Intraocular Pressure Dataset . . . . . . . . . . . . . 208

7.7.2 Height Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 209

Bibliography 211

10

List of Figures

2.1 Comparison between PBF, PF, PLR for simple linear regression. . . . 48

2.2 Power comparison between PBF and PLR for p = 2 . . . . . . . . . . 56

2.3 How BFnull changes with σ . . . . . . . . . . . . . . . . . . . . . . . 57

2.4 Calibration of PBF and PLR for p = 10 . . . . . . . . . . . . . . . . 70

2.5 Calibration of PBF and PLR for p = 20 . . . . . . . . . . . . . . . . 71

3.1 The relationship between ρ(Ψ) and n . . . . . . . . . . . . . . . . . 89

3.2 Distribution of optimal ρ(Ψ) in presence of multicollinearity . . . . 90

3.3 Wall time usage of ICF, Chol, GS, SOR and CG . . . . . . . . . . . 94

3.4 Iterations used by ICF, SOR and CG . . . . . . . . . . . . . . . . . 96

3.5 Accuracy of ICF, SOR and CG . . . . . . . . . . . . . . . . . . . . 97

4.1 Heritability estimation with 200 causal SNPs . . . . . . . . . . . . . 122

4.2 Heritability estimation with 1000 causal SNPs . . . . . . . . . . . . 123

4.3 Posterior estimation for the model size . . . . . . . . . . . . . . . . 124

4.4 Calibration of the posterior inclusion probabilities . . . . . . . . . . 126

4.5 Calibration of the PIPs with insufficient MCMC iterations . . . . . 127

4.6 Relative prediction gain of fastBVSR and GCTA . . . . . . . . . . 130

4.7 Wall time used by fastBVSR for 10K MCMC iterations. . . . . . . 131

5.1 How BFnull and sBF change with σ . . . . . . . . . . . . . . . . . . 134

5.2 Distributions of BFnull and sBF in the IOP dataset . . . . . . . . . 136

5.3 Heritability estimation using the scaled Bayes factor . . . . . . . . . 149

5.4 Calibration of the Rao-Blackwellized posterior inclusion probabili-ties for sBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.5 Relative prediction gain of the scaled Bayes factor . . . . . . . . . . 151

11

List of Tables

3.1 Wall time usage of ICF, Chol, GS, SOR and CG under the null model 95

4.1 Heritability estimation with 200 causal SNPs . . . . . . . . . . . . . 121

5.1 Top 20 single SNP associations by BFnull (σ = 0.2) in the IOPdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.2 Top 20 single SNP associations by BFnull (σ = 0.5) in the IOPdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.3 Taylor seriess approximations for the ideal scaling factors . . . . . . 143

5.4 Heritability estimation using the scaled Bayes factor . . . . . . . . . 148

12

Symbols and Notations

R real numbers

a scalar a

a vector a (lowercase bold)

||a||p `p-norm of a

A matrix A (uppercase bold)

At transpose of A

A∗ conjugate transpose of A

|A| determinant of A

tr(A) trace of A

I identity matrix

In identity matrix of size n× n

P(A) probability of event A

E[X] expected value of X

def.= definition

IA(x) indicator function

l.h.s left-hand side

r.h.s right-hand side

i.i.d. independent and identically distributed

ind. independent

13

Abbreviations

BVSR Bayesian variable selection regression

BFnull null-based Bayes factor

FDR false discovery rate

GWAS genome-wide association study

ICF iterative solutions using complex factorization

LASSO least absolute shrinkage and selection operator

LD linkage disequilibrium

LRT likelihood ratio test

MAF minor allele frequency

MAP maximum a posteriori

MSE mean squared error

MSPE mean squared prediction error

MCMC Markov chain Monte Carlo

MVN multivariate normal distribution

PIP posterior inclusion probability

REML restricted/residual maximum likelihood

RPG relative prediction gain

sBF scaled Bayes factor

SNP single nucleotide polymorphism

SVD singular value decomposition

14

Chapter 1

Introduction

1.1 Genome-Wide Association Studies

Genome-wide association studies (GWASs) refer to the analyses using a dense set

of SNPs across the whole genome with the ultimate aim of predicting disease risks

and identifying the genetic foundations for complex diseases [Bush and Moore,

2012]. Unlike candidate-gene association analysis [Hirschhorn and Daly, 2005],

GWAS scans the whole genome and a typical study may involve about one million

SNPs. Since the discovery of complement factor H as a susceptibility gene for

age-related macular degeneration (AMD) [Edwards et al., 2005, Haines et al.,

2005, Klein et al., 2005], GWASs have made great success in the demystification

of the genetic architecture of some common complex diseases [McCarthy et al.,

2008], such as breast cancer [Easton et al., 2007], prostate cancer [Thomas et al.,

2008], type I and type II diabetes [Todd et al., 2007, Zeggini et al., 2008] and

inflammatory bowel disease [Duerr et al., 2006]. A complete list of GWAS findings

is available at the NHGRI-EBI (National Human Genome Research Institute and

European Bioinformatics Institute ) GWAS Catalog [Welter et al., 2014]. Besides,

15

GWASs have also shed light on the individual drug metabolism and given birth to

the idea of personalized medicine (see Motsinger-Reif et al. [2013] for a review).

In this section I will explain several important genetic or statistical concepts that

are indispensable to the understanding of GWAS. Some notions like heritability

and the details of the statistical methods will be explicated in later sections.

1.1.1 Some Genetic Concepts

Single Nucleotide Polymorphism Commonly abbreviated as SNP, single nu-

cleotide polymorphism refers to the single base-pair change in the DNA sequence

that has a high prevalence, say greater than 1%, in some population. Most SNPs

are biallelic and the frequency of the less common allele is denoted by MAF (minor

allele frequency). For example, suppose we know some SNP with major allele A

and minor allele T has an MAF equal to 0.2 in some population (human DNA only

contains four types of nucleotides: A, T, C, G). Then at that SNP locus, under

certain conditions, we may estimate about 64% of that population have genotype

AA, about 4% have genotype TT and the remaining have genotype AT (human

genome is a diploid set). Here implicitly the Hardy-Weinberg equilibrium is as-

sumed, which refers to a set of conditions that allows one to assume the number

of copies of the minor allele follows a binomial distribution. SNPs with more than

two alleles are very rare and they are often excluded from the statistical analysis.

Note that in meta-analysis, when we merge datasets from different studies, great

care should be taken when dealing with AT and CG SNPs since a flipping of the

reference allele can easily cause a false positive in the association testing. Most

SNPs do not pose any obvious harm to human health, since they are either lo-

cated in the non-coding region, which occupies about 98% of the whole genome,

or the allele change is synonymous, which means the amino acid coded remains

16

the same. Some SNPs may cause amino acid changes and then directly alter the

protein functions however the overall effect on the human body is usually im-

perceptible since otherwise they would probably have been eliminated by natural

selection. SNPs with MAF lower than 5% (or sometimes 1%) are often called rare

variants [Committee, 2009, Lee et al., 2014]. Extremely rare variants, which can

be detrimental to protein functions, are often called mutations [Bush and Moore,

2012].

Quantitative and Categorical Traits In GWAS, the response variable y in a

regression model is often referred to as the trait or the phenotype. A trait can be

categorical, in fact often binary, for example the case/control status of a disease

like cancer. In such situations, an association testing between the trait and the

SNP directly estimates how likely the SNP may be disease-predisposing and its

odds ratio. For some other diseases, there exists a quantitative trait that is known

as a risk factor, for example the lipid level [Kathiresan et al., 2008]. Then the

association testing with that quantitative trait can also help us find the genetic

variants that could be used for predicting disease risks. Other typical continuous

traits used in GWASs include height [Weedon et al., 2008] and fat mass [Scuteri

et al., 2007]. Height appears to be the best candidate trait for studying the

heritability (see Chap. 1.3.2).

Common Disease-Common Variant Hypothesis Underlying most of the

GWAS methodologies targeting common complex diseases is the common disease-

common variant (CD-CV) hypothesis[Lander, 1996, Reich and Lander, 2001],

which predicts that the genetic risk factors for common diseases like diabetes and

cancer should mostly be alleles with relatively high frequencies (> 1%). Hence

the traditional family-based genetic studies may fail due to the very small effects

17

of the disease-predisposing variants (since the total number of the causal vari-

ants is large) but a whole-genome scan and joint analysis of the genetic variants

for hundreds or thousands of case and control subjects are believed to lead to

some findings. Early evidence supporting this hypothesis includes the high-MAF

variants in the APOE gene that are risk factors for Alzheimer’s disease [Corder

et al., 1993]. For some diseases, guided by this principle, exciting discoveries have

been made as mentioned at the beginning of this section while for some other

diseases like asthma and coronary heart diseases, GWASs haven been much less

fruitful [McCarthy et al., 2008].

There are doubts about the CD-CV hypothesis [Pritchard and Cox, 2002].

As a complement, genome-wide rare-variant association studies have attracted

increasing attention in the past decade and made impressive breakthroughs [Li

and Leal, 2008]. But unlike common variant analysis, one major difficulty of the

rare variant analysis is the low statistical power (to be explained shortly) due to

the low frequencies of the variants (the effective sample size is small). See Asimit

and Zeggini [2010] and Lee et al. [2014] for reviews on the methods for rare variant

studies. In this thesis, special treatments are not to be given to rare variants.

Linkage Disequilibrium Linkage disequilibrium (LD) may be defined as the

degree to which an allele of one SNP is correlated with an allele of another

SNP [Bush and Moore, 2012]. In meiosis, recombination can break a chromo-

some into several segments and thus the genome of a child is a mosaic of his/her

parental genomes. However, nearby SNPs on a paternal or a maternal chromo-

some are usually inherited together since the recombination rate is low. Even

after many generations, fixed combination patterns of nearby alleles are still very

common across the whole genome and they are often called haplotypes. When two

SNPs have statistical correlation equal to 1, we say they are in perfect LD. When

18

the correlation is zero, we say they are in linkage equilibrium. Note that for two

biallelic SNPs, uncorrelatedness is equivalent to independence. For other com-

monly used measures for LD, see Devlin and Risch [1995]. Linkage disequilibrium

brings about both benefits and problems to GWASs. Thanks to LD, we do not

need to genotype every SNP of human genome to search for the “causal” variants

because a SNP in LD with the truly causal variant would also show association

with the phenotype. This is called indirect association [Collins et al., 1997]. As a

result, whether the susceptible genes can be detected would largely depend on the

degree of LD between the genotyped variant and the truly causal variant [Ohashi

and Tokunaga, 2001]. On the other hand, the efficiency of GWAS is reduced by

LD since association tests for different SNPs are correlated. We will discuss this

issue in greater detail later.

Imputation and Phasing Missing genotypes are common in GWAS datasets.

If the missing rate is very low, one may simply replace each missing value with

the mean or the median of that SNP. But this is clearly not the ideal solution.

Furthermore, sometimes we need to combine datasets from different studies but

they are generated from different genotyping platforms and different SNP arrays.

Because the intersection of the genotyped SNPs is often small, there can be a

large proportion of missing values . One way to make full use of the data in such

situations is to do imputation, which means to infer the missing SNPs using the

neighbouring genotyped SNPs. The key idea behind imputation is the linkage

disequilibrium, i.e., the fact that the SNPs that are located closely are not inde-

pendently distributed. Existing software packages include BIMBAM [Guan and

Stephens, 2008], IMPUTE [Howie et al., 2009], MACH [Biernacka et al., 2009] and

BEAGLE [Browning and Browning].

Another very similar concept is called phasing, which refers to the inference

19

of haplotypes using genotypes. Human genome is a diploid set and contains two

copies of haploid genomes. Hence the genotype of a SNP takes value in 0, 1, 2,

which represents the counts of minor (or major) alleles. The current genotyping

technology can only report the genotypes but cannot distinguish which alleles

come from the same chromosome. Phasing is then used to separate a sequence

of genotypes into two copies called haplotypes. Just like imputation, phasing

also relies on the linkage disequilibrium. Both of them use hidden Markov model

to model the LD and make inferences. Indeed, most of the imputation tools

can perform phasing as well. Software packages designed specifically for phasing

include SHAPEIT [Delaneau et al., 2008] and fastPHASE [Scheet and Stephens,

2006]. The booming next-generation sequencing technology is generating tons of

phased SNP data (sequencing data) and perhaps in the near feature association

tests using haplotypes instead of SNPs will become a standard strategy for GWAS.

1.1.2 Some Statistical Concepts

Statistical Significance and Power Consider an association test with a single

SNP. The goal is to figure out whether the SNP has an effect (direct or indirect)

on the phenotype. Regarding the testing result there are four possible scenarios:

true positive, true negative, false positive and false negative. If a SNP that has

no effect is called positive by the test, we call it a false positive or a type I error.

If a causal SNP is not identified by the test, we call it a false negative or a type

II error. Using statistical language for hypothesis testing, we say no effect of the

SNP on the phenotype is the null hypothesis and the probability of incorrectly

rejecting a null hypothesis is then called the type I error rate or the size of the

test. In practice a type I error is usually much more harmful than a type II

error and thus when conducting a hypothesis testing, we would like to control

20

the type I error rate under some threshold, which is called the significance level,

often denoted by α. To this end, a statistic named p-value is computed, such

that by rejecting the null when the p-value is smaller than α, the type I error

rate is controlled. In fact p-value is equal to the “tail probability” under the

null hypothesis, and sometimes referred to as the “observed significance level”. A

smaller p-value indicates a more significant association (it does not imply a larger

effect size). Another important concept in hypothesis testing, power, is defined as

the probability of correctly rejecting the null, i.e., one minus type II error rate. It

is the most critical metric when comparing different testing methods. At a given

significance level, the method with the larger power is deemed better. The power

of a GWAS testing procedure depends on four factors: the testing method, the

significance level, the true effect size of the causal variant, and the information

we have in the data. In general, when we have a larger sample size or the causal

variant has a larger MAF, the test would have a greater power. Since in most

cases the causal SNPs only have small effects on the trait, how to construct a

more powerful test is a central question to the statisticians engaging in GWASs.

For more discussion on the power in GWAS, see de Bakker et al. [2005] among

others.

Single SNP Test Single SNP test, which means to test every SNP separately

for association with the phenotype, is the most common statistical strategy for

detecting causal variant in GWAS. Here by “causal” we mean the variant has

either a direct or indirect effect, i.e., the “causal variant” may not be the exact

genetic cause of the disease but is truly correlated with the phenotype and could

be used for predicting disease risk. See Morris and Kaplan [2002] and Martin

et al. [2000] for both theoretical and empirical reasons why the single SNP test is

preferred to the multi-locus test. From a practical point of view, the multi-locus

21

test faces many computational difficulties since the number of possible models

from a GWAS dataset is extremely enormous. Even if we only want to test all the

possible two-SNP or three-SNP models, the total number is already much greater

than what modern computers can handle.

For a binary trait (case/control status), the traditional single SNP test methods

include logistic regression and contingency table. The Cochran-Armitage trend

test [Agresti and Kateri, 2011, Chap. 5.3] is deemed the best choice in many

settings [McCarthy et al., 2008]. For a quantitative trait, linear regression, gener-

alized linear regression and ANOVA are often used. The genotype often consists

of only 0, 1, 2 to represent the copies of the minor alleles. But when genotypes

are imputed, they may certainly take any value within [0, 2]. The effect of the

minor allele can be modelled by several ways: dominant, recessive, multiplicative

and additive. A simple linear regression model with 0, 1, 2 coding corresponds

to the additive model.

Bayesian approaches to single SNP testing are not as welcome as non-Bayesian

methods mainly due to the difficulty of producing p-values, though the Bayesian

substitute, the Bayes factor, has its own advantages. The theoretical work in

Chap. 2 will offer a solution. Nevertheless, there are some successful attempts of

Bayesian methods for single SNP testing. See, for example, Marchini et al. [2007]

and Servin and Stephens [2007].

Confounding Covariates Apart from genetic variants, there can be other fac-

tors influencing the phenotype, for example age and sex. Theoretically speaking,

we always need to control for these confounding variables in order to avoid spuri-

ous associations and increase testing power. For a somewhat contrived example,

consider a sex-linked trait like red-green color blindness. If we do not control for

22

sex when the study subjects contain equally many males and females, probably

a large proportion of the SNPs located on the X chromosome would be tested

positive.

The confounding factor that needs to be worried about in practice is the pop-

ulation stratification. Many complex diseases are known to have difference preva-

lence rates in different populations (like Asian, African, European). If the popular

stratification is not appropriately accounted for when the dataset consists of sub-

jects from different populations, the ethnic-specific SNPs are very likely to be

tested positive. In such cases, “inflated” p-values can often be observed. To

explain this, recall that under the null hypothesis, the p-value is uniformly dis-

tributed on (0, 1) [Casella and Berger, 2002, Chap. 8.3]. Since in GWAS as many

as 1 million SNPs may be tested for association, most of the tests are expected

to be under the null and consequently, the p-values should exhibit a uniform dis-

tribution on (0, 1) except at the tail. But if some confounding factor fails to be

controlled for, the p-values (after ordering) can display a clear overall tendency

to be smaller than their expected values, which is called inflation. See Gamazon

et al. [2015] for a figure of inflated p-values. It seems that population stratifica-

tion was not controlled for in that study. A simple method for correcting inflation

is genomic control [Devlin and Roeder, 1999, Devlin et al., 2001]. Today people

usually prefer to use principal component analysis [Price et al., 2006]. One can

do eigendecomposition by oneself and add the first three to ten principal compo-

nent scores as covariates or use software like STRUCTURE [Pritchard et al., 2000]

and EIGENSTRAT [Price et al., 2006]. The International HapMap project [The

International HapMap Consortium, 2010] provides samples from different ethnic

groups that have been very densely genotyped and are often used as the reference

panel in the principal component analysis.

23

Correction for Multiple Testing For a single test, one simply compares the

p-value with the significance threshold α. However, when one performs multiple

tests and still wants to control the probability of making one or more type I errors

less than α, a more stringent cutoff for the p-values is needed. For instance, if two

independent tests both have type I error rate equal to 0.05, the probability of mak-

ing at least one type I error would be as much as 0.098. The most widely used, and

probably the most convenient, correction method is Bonferroni correction, which

is an approximation to Sidak’s formula [Sidak, 1968, 1971]. Unfortunately, both

Bonferroni and Sidak’s correction methods are derived assuming the independence

of the tests and turn out to be too exacting in GWAS so that some true signals

might have to be discarded. Thus many substitutes for Bonferroni correction have

been proposed, for example Nyholt [2004] and Conneely and Boehnke [2007]. A

non-parametric method for calculating the necessary p-value threshold is permu-

tation, which is implemented in software like PLINK [Purcell et al., 2007] and

PERMORY [Pahl and Schafer, 2010].

A simple rule of thumb for determining whether the p-value is significant in

GWAS is the so-called genome-wide significance threshold. Due to the LD between

the SNPs, if all the SNPs across the whole genome are genotyped, the effective

number should be much smaller. It is estimated that most of the SNPs in the hu-

man genome could be expressed as a linear combination of 500, 000 to 1, 000, 000

SNPs. Using the dataset from Wellcome Trust Case-Control Consortium [Burton

et al., 2007], Dudbridge and Gusnanto [2008] estimated the genome-wide signif-

icance threshold to be 7.2 × 10−8, which corresponds to a 0.05 family-wise type

I error rate, for GWASs with subjects of European descent. A more widely used

threshold that can be applied to any GWAS is 5× 10−8 [Barsh et al., 2012, Pana-

giotou and Ioannidis, 2012, Jannot et al., 2015]. It can be thought of as obtained

from the Bonferroni correction to α = 0.05 assuming 1 million independent SNPs.

24

Thus this threshold should only be used when the total number of tests is greater

than 1 million.

Another approach is to control the false discovery rate (FDR) [Benjamini and

Hochberg, 1995] instead of the type I error rate. In effect it allows a larger p-value

cutoff and thus more SNPs would be declared significant. With the rapid devel-

opment of biological technologies, the validation of a causal variant by molecular

experiment becomes easier and thus scientists are willing to increase the test power

at the cost of more type I errors. See Storey and Tibshirani [2003] for a discussion

on the use of FDR in GWAS. See Sun and Cai [2009] for a review on different

methods for controlling FDR.

1.2 Bayesian Linear Regression

The following Bayesian linear regression model is the main object of this work:

y | β, τ ∼ MVN(Xβ, τ−1I),

β | τ,V ∼ MVN(0, τ−1V ),

τ | κ1, κ2 ∼ Gamma(κ1/2, κ2/2), κ1, κ2 → 0.

(1.1)

y = (y1, . . . , yn) is the response vector, X is an n × p design matrix and β is

a p-vector called the regression coefficients. I denotes the diagonal matrix and

MVN stands for multivariate normal distribution. The first statement in (1.1) is

equivalent to

y = Xβ + ε

ε | τ ∼ MVN(0, τ−1I),

25

and thus implicitly the errors ε1, . . . , εn are assumed to be i.i.d. normal random

variables with mean 0 and variance τ−1. The second and the third lines of (1.1) are

called the normal-inverse-gamma prior, which is conjugate for the normal linear

model. The only prior parameter that need to be specified is the covariance matrix

V . The other parameters, κ1, κ2, are let go to 0 to represent a noninformative

setting. See Chap. 7.2 for more details and variations of this model. Here is a

summary of important points.

• All of our major results hold for the full regression model y = Wa+Xb+ε,

where W represents the confounding covariates to be controlled for and L

represents the variables of interest. The Bayes factor for this full model is

equivalent to the Bayes factor for model (1.1) once y and X are replaced

with their residuals after regressing out W . See Chap. 7.2.3 for proof and

discussion.

• The intercept term does not need to be included in model (1.1) due to the

reason explained in the last remark. It is equivalent to centering bothX and

y. However, there is a slight difference between the following two statements:

(a) y | β, τ ∼ MVN(Xβ, τ−1I); (b) y | β, τ,µ ∼ MVN(Xβ + µ, τ−1I).

Because µ is unknown, when it is integrated out the errors “lose” one degree

of freedom. This difference, nevertheless, has very little effect, as we will see

in Chap. 2.1. The same rationale applies to regressing out W .

• The prior for τ is equivalent to the well-known Jeffreys prior, which is most

commonly used in the literature. It is the standard choice under a noninfor-

mative setting. The posterior for β and τ are still proper.

• In rare applications, it might be more desirable to assume τ−1 is known.

The inferences with known error variance are also discussed in Chap. 7.2.

26

Note that as n goes to infinity, τ can be estimated precisely in the sense

that its posterior contracts to the true value. Therefore, the case of known

error variance is again a special case, or rather, the limiting case as n→∞,

of model (1.1). This intuition is very important in deriving the asymptotic

distribution of the Bayes factor (Chap. 2.1).

The conditional posterior of β given τ is (see Chap. 7.2.1)

β|y, τ,V ∼ MVN((X tX + V −1)−1X ty, τ−1(X tX + V −1)−1).

Hence, the maximum a posteriori (MAP) estimator for β is

β = (X tX + V −1)−1X ty. (1.2)

The null-based Bayes factor for model (1.1) is given by (see Chap. 7.2.2)

BFnull = |I +X tXV |−1/2

yty − ytX(X tX + V −1)−1X ty

yty

−n/2. (1.3)

It is straightforward to check the BFnull defined in (1.3) is invariant to the scaling

of y. Throughout this study whenever we refer to a Bayes factor, the null model is

assumed used as the reference unless otherwise stated. For the covariance matrix

V , we consider two choices.

Independent normal prior: V = σ2I; (1.4)

Zellner’s g-prior: V = g(X tX)−1. (1.5)

27

1.3 Applications of Bayesian Linear Regression

to GWAS

There is probably little doubt that linear regression is the most widely used sta-

tistical model. In genome-wide association studies (GWAS), Bayesian linear re-

gression, though much less favorable than its non-Bayesian counterpart, has been

applied for various purposes in an extensive amount of literature. For a review,

see Balding [2006] and Stephens and Balding [2009].

1.3.1 Association Testing

Although the regression model appears to imply a prospective study design (y is

random and X is fixed), it could be applied to retrospective studies as well, as

justified by Seaman and Richardson [2004]. For a quantitative trait, the single

SNP test could be performed using model (1.1) where X only has one column.

Servin and Stephens [2007] proposed such a model and discussed how to choose a

noninformative and improper prior that admits a proper Bayes factor, which will

be used in the derivation of the distribution of Bayes factors in Chap. 2. For a

binary trait, the Bayesian logistic regression model is the appropriate choice [Mar-

chini et al., 2007]. Wakefield [2009] proposed an asymptotic method which is very

efficient compared with most inference methods for the logistic regression. On a

side note, both Servin and Stephens [2007] and Marchini et al. [2007] discussed

how to perform association testing using imputed genotypes.

For many non-statisticians, an uneasy feature of the Bayesian association test,

or more generally Bayesian hypothesis testing, is that it produces a Bayes factor

instead of the p-value. Fairly speaking, each statistic has its merits and demer-

its. See Kass and Raftery [1995], Lavine and Schervish [1999], Katki [2008]

28

and Goodman [1999] among many others for comparisons between p-values and

Bayes factors. Bayesians would probably prefer the Bayes factor since it measures

the evidence of the alternative hypothesis while the p-value does not. Another

practical advantage of the Bayes factor is its convenience in combining multiple

tests. The Bayes factor comparing a model M against the null model M0 is

defined by

BFnull(M)def.=

p(y | M)

p(y | M0),

where p(y | ·) denotes the marginal likelihood of the model (see Chap. 7.2.2 for

more details). Suppose we have a small candidate genomic region that contains

K SNPs and we want to average the association testing over these K SNPs. Then

the Bayes factor for this SNP set, or for this region, is simply

BFnull(M) =1

p(y | M0)

K∑i=1

p(y | xi, M)p(xi | M).

Similarly we can also average over the four genetic models: dominant, recessive,

additive and multiplicative. This method was implemented in Marchini et al.

[2007].

The multi-locus association test with model (1.1) is often used to model the

joint effect of the SNPs within a restricted region, for example a candidate gene.

Servin and Stephens [2007] tested all the possibleK-QTN (QTN: quantitative trait

nucleotide) models within the given region for K = 1, 2, 3, 4. Another potential

application of the multi-linear regression is rare variant studies. The sequencing

kernel association test (SKAT) proposed by Wu et al. [2011] and Ionita-Laza et al.

[2013] uses the variance component model, a non-Bayesian method, to identify the

rare variants associated with the phenotype. Using the Bayesian linear regression

29

model (1.1) with the independent normal prior (1.4), the idea of SKAT can be

interpreted as testing the null hypothesis σ2 = 0 against the alternative σ2 > 0.

1.3.2 Variable Selection, Heritability Estimation and

Prediction

A typical variable selection procedure with the regression model assumes

y =N∑i=1

βixi + ε, ε ∼ MVN(0, τ−1I),

where N is the total number of SNPs in the dataset and most of β’s are equal to

0. Variable selection is to simultaneously analyze all the SNPs and identify which

β’s are not zero. Unlike frequentists’ methods that aim to find a single optimal

model, Bayesian variable selection tries to estimate the probability P(βi 6= 0) for

every SNP. Variable selection is one of the central topics of the applied Bayesian

analysis. Chap. 4.1 gives an extensive review on the generic methods for Bayesian

variable selection based on regression.

Early attempts of Bayesian variable selection in genetic studies usually involved

up to a few hundred covariates. The most typical application was the mapping

of quantitative trait loci (QTLs) [Uimari and Hoeschele, 1997, Sillanpaa and Ar-

jas, 1998, Broman and Speed, 2002, Kilpikari and Sillanpaa, 2003, Meuwissen and

Goddard, 2004]. Besides, Yi et al. [2005] and Hoti and Sillanpaa [2006] studied

the epistatic interactions and the genotype-expression interactions respectively.

Although the computational methods used in those studies were not designed for

whole-genome datasets, some key ideas and techniques were employed later in

GWASs. For instance, the Jeffreys’ shrinkage prior used by Xu [2003] and Wang

et al. [2005] (see Ter Braak et al. [2005] for a discussion on the propriety of the

30

posterior), the Laplace shrinkage prior used by Yi and Xu [2008], the reversible

jump MCMC approach taken by Lunn et al. [2006], the stochastic search vari-

able selection algorithm used by Yi et al. [2003] and the composite model space

search of Yi [2004]. Recent years have witnessed an increasing number of reward-

ing applications of Bayesian variable selection to GWASs. Li et al. [2011] used

Bayesian LASSO to detect the genes associated with body mass index; Ishwaran

and Rao [2011] applied a Gibbs sampling scheme proposed in Ishwaran and Rao

[2000] to the microarray data analysis for colon cancer; Stahl et al. [2012] proposed

an approximate bayesian computation method and studied a GWAS dataset for

rheumatoid arthritis.

One particular application that needs emphasis is the heritability estimation.

The narrow-sense heritability refers to the proportion of the phenotypic variation

that is due to additive genetic effects. It can be reliably estimated from close rela-

tive data, especially twin studies [Gielen et al., 2008], using the statistical methods

that can be traced back to Galton [Galton, 1894] and Fisher [Fisher, 1919]. For

example, the heritability of height was estimated to be between 0.77− 0.91 [Mac-

gregor et al., 2006]. But, in stark contrast, early GWASs on tens of thousands of

individuals detected around 50 variants statistically associated with height, but

in total they could only explain about 5% of the phenotypic variance [Yang et al.,

2010]. A later study increased the proportion of variance explained to 10% after

identifying 180 associated loci from 183, 727 individuals [Allen et al., 2010]. This

huge gap is referred to by “missing heritability”. Two reasons immediately stood

out. First, many causal variants may be neither genotyped, nor in complete link-

age disequilibrium with the genotyped ones. Second, most causal variants may

only contribute a very small amount of variation and thus fail to reach the sig-

nificance thresholds. Other theories like rare variants and epistatic interactions

cannot explain the fact that most heritability is missing.

31

There were few methodological studies on the heritability estimation before

the advent of GWAS. But one of them, Meuwissen et al. [2001], compared by

simulation the performance between the Bayesian models and the linear mixed

model, which actually represent the current mainstream approaches to heritabil-

ity estimation. The first sensible heritability estimate from real GWAS datasets

was attained by GCTA [Yang et al., 2011], a package that implemented the re-

stricted maximum likelihood inference for linear mixed model (see Chap. 7.5.1).

It estimated the heritability of height to be about 45% from 3, 925 subjects [Yang

et al., 2010]. The rationale of GCTA is in fact the same as the classical methods

using relative data, that is, if two individuals are similar in genetics, they must

have similar phenotypes. GCTA uses all the SNPs in the dataset to calculate the

genetic relatedness between individuals and inferred the heritability using that ge-

netic relationship matrix. The Bayesian approach was later proposed by Guan and

Stephens [2011] where the heritability was modelled as a hyperparameter and the

regression coefficients were treated with the standard spike-and-slab prior. The

model was then generalized by Zhou et al. [2013] where the prior for the regression

coefficients became a mixture of two normal distributions. Compared with GCTA,

the Bayesian methods have the following advantages. First, GCTA assumes every

SNP makes a small i.i.d contribution to the phenotype, which could be seriously

violated for some phenotypes. On the contrary, Bayesian methods do not rely on

this assumption and can pick out the SNPs with larger effects by variable selec-

tion. Second, the heritability estimator of GCTA usually has a larger variance

than Bayesian estimates. Last but not least, the prediction performance of GCTA

is poor, on which we want to make more comments.

Prediction is one of the ultimate goals of both GWAS and variable selection.

A model that perfectly fits the observed data is practically useless if it has no

32

predictive power. This is why in variable selection, given so many potential pre-

dictors we only want to select a small number of them and shrink their regression

coefficients. Although GCTA is computationally very fast and could provide un-

biased estimators for the regression coefficients when the model assumptions hold,

its prediction performance, shown in [Zhou et al., 2013, Fig. 2], is much worse than

the Bayesian approaches. For more general purposes, Guan and Stephens [2011]

showed that BVSR outperformed LASSO (least absolute shrinkage and selection

operator) [Tibshirani, 1996], which is one of the most widely used non-Bayesian

variable selection procedure. This advantage is partly due to the model averag-

ing [Raftery et al., 1997].

33

Chapter 2

Distribution and P-value of the

Bayes Factor

2.1 Distribution of Bayes Factors in Linear

Regression

The null-based Bayes factor for the linear regression model introduced in Sec-

tion 1.2 is given by

BFnull(X,V ) = |I +X tXV |−1/2

1− y

tX(X tX + V −1)−1X ty

yty

−n/2, (2.1)

where V is the normalized prior variance matrix for β (see Section 1.2 for the

notation). In order to calculate the p-value for this Bayes factor, it is necessary

to characterize its distribution under the null model, that is, y ∼ MVN(0, τ−1I).

To this end, the distribution of the quadratic form ztX(X tX + V −1)−1X tz is

first identified, where z is a multivariate normal variable with covariance matrix I.

Then it is used to derive the asymptotic distribution of log BFnull. The distribution

34

of log BFnull under the alternative model (β 6= 0) is also discussed.

2.1.1 Distributions of Quadratic Forms

Throughout this chapter, we define

Hdef.= X(X tX + V −1)−1X t (2.2)

which is symmetric and thus admits a spectral decomposition,

H = UΛU t =

p∑i=1

λiuiuti, (2.3)

with eigenvalues (diagonals of Λ) λ1 ≥ · · · ≥ λp ≥ 0 = λp+1 = · · · = λn and

eigenvectors u1, . . . ,un. We first prove a lemma about these eigenvalues. See

Chap. 7.1.3 for the definition and properties of eigenvalues.

Proposition 2.1. Let the spectral decomposition of H be given by (2.3). Then,

(a) 0 ≤ λi < 1 for 1 ≤ i ≤ p;

(b) H and X have the same (left) null space;

(c) log(|I +X tXV |) = −∑p

i=1 log(1− λi).

Proof. Suppose λi 6= 0. Since we may write

H = XV 1/2(V 1/2X tXV 1/2 + I)−1V 1/2X t,

by Lemma 7.8, λi is also an eigenvalue of (V 1/2X tXV 1/2 + I)−1V 1/2X tXV 1/2.

λi/(1−λi) is then an eigenvalue of V 1/2X tXV 1/2 by spectral decomposition. Next

we claim V 1/2X tXV 1/2 and H have the same number of nonzero eigenvalues.

35

This is because

rank(V 1/2X tXV 1/2) = rank(X tX) = rank(X) = rank(H).

The first two equalities follow from the properties of rank. The last equality needs

more explanation. If Hz = 0 for some vector z, then ztHz = 0. Since (X tX +

V −1)−1 is positive definite, this implies that X tz = 0. Hence H and X have the

same (left) null space (the other direction is trivial), which implies rank(X) =

rank(H). Since V 1/2X tXV 1/2 is positive semi-definite, its eigenvalues must be

nonnegative and thus λi ∈ [0, 1). Lastly, by Sylvester’s determinant formula,

log(|I +X tXV |) = log |I + V 1/2X tXV 1/2| which finishes the proof.

Define the second term, which depends on y, in the expression of BFnull in (2.1)

by R , i.e.,

2 logRdef.= −n log

(1− y

tX(X tX + V −1)−1X ty

yty

)= −n log(1− ytHy/yty).

By Proposition 2.1 (c),

2 log BFnull = 2 logR +

p∑i=1

log(1− λi).

Since λ1, . . . , λp are constants that depend only on the data X and prior V , to

characterize the distribution of 2 log BFnull, we only need to figure out the distribu-

tion of 2 logR. Instead of considering the null and the alternative separately, let’s

assume y ∼ MVN(Xβ, τ−1I). Then the alternative model corresponds to the

cases where β 6= 0 (or follows a nondegenerate distribution) and the null model is

36

a special case with β = 0. Define the standardized response variable z by

zdef.= τ 1/2y ∼ MVN(τ 1/2Xβ, I),

and rewrite the expression for 2 logR by

2 logR = −n log(1− ztHz/ztz).

The statistic 2 logR is closely related to the likelihood ratio test (LRT) statistic,

which can be written as (see Chap. 2.2.1)

2 log LR = −n log(1− ztH0z/ztz)

where H0def.= X(X tX)+X t.

(2.4)

H0 is the hat matrix in traditional linear regression by least squares (Chap. 7.1.4).

Clearly the distributions of the two statistics are determined by the distributions

of the quadratic forms ztHz, ztH0z and ztz.

Definition 2.1. A random variable Q is said to have a noncentral chi-squared

distribution with 1 degree of freedom and noncentrality parameter ρ ≥ 0, if it has

the same distribution as (Z +√ρ)2 where Z ∼ N(0, 1). The distribution of Q is

denoted by Q ∼ χ21(ρ). When ρ = 0, the distribution of Q reduces to the central

chi-squared distribution and is denoted by Q ∼ χ21.

Proposition 2.2. Let the spectral decomposition of H be given by (2.3) and

assume rank(X) = p′. Let z ∼ MVN(τ 1/2Xβ, I). Then,

(a) ztH0z =p′∑i=1

(utiz)2 and ztHz =p′∑i=1

λi(utiz)2;

(b) ztz = ztH0z +n∑

i=p′+1

(utiz)2 ;

37

(c) For 1 ≤ i ≤ p′, (utiz)2 ind.∼ χ21(τ(utiXβ)2);

(d)n∑

i=p′+1

(utiz)2 def.= Q0 ∼ χ2

n−p′ and is independent of (utiz)2 for 1 ≤ i ≤ p′.

Proof. By the spectral decomposition of H , ztHz = (U tz)tΛ(U tz). Since U is

orthogonal, U tz ∼ MVN(τ 1/2U tXβ, In). The covariance matrix is diagonal and

thus every (utiz)2 is independent of each other by Bernstein’s Theorem [Lukacs

and King, 1954]. Note that this result is not trivial at all (uncorrelatedness is not

equivalent to independence), and will be used frequently in the following proofs.

Part (c) is then self-evident. Recall that H and X have the same rank and the

same left null space by Proposition 2.1 (b). Thus for i > p′, λi = 0 and utiX = 0.

Part (d) is then proved. Part (b) is immediate from part (a) since ztz = ztU tUz.

So the only remaining task is to work out the spectral decomposition of H0. By

Proposition 7.15, H0 has p′ eigenvalues equal to 1 and the rest are zero. We claim

its spectral decomposition can be written as

H0 = UΛ0Ut =

p′∑i=1

uiuti (2.5)

where Λ0 = diag(1, . . . , 1, 0, . . . , 0) and U are the eigenvectors of H as defined

in (2.3). To prove this, let the singular value decomposition (Chap. 7.1.2) of X

be X = U0D0Vt

0 . Then H0 = U0D0D+0 U

t0 which implies that the null space of

H0 is the same as the left null space of X, and thus the same as the null space of

H from Proposition 2.1 (b). Hence H0 and H also have the same column space

and there exist orthogonal matrices E1, E2 such that U = U0 diag(E1,E2).

Immediately we have the following corollary under the null.

Corollary 2.3. If z ∼ MVN(0, In) , then ztH0z ∼ χ2p and ztHz =

p∑i=1

λi(utiz)2

where (utiz)2 are independent χ21 random variables.

38

Both H and H0 are positive semi-definite matrices. By their spectral decom-

positions given in Proposition 2.2 (a), we have the following lemma.

Lemma 2.4. For any vector z ∈ Rn, 0 ≤ ztHz ≤ ztH0z.

2.1.2 Asymptotic Distributions of log BFnull

In this section, it will be shown that, loosely speaking, when the sample size n

is sufficiently large, 2 logR has approximately the same distribution as ztHz,

if ztHz doesn’t grow or grows slowly. To further explain the ideas, consider a

sequence of datasets with sample size n tending to infinity. For simplicity (and to

avoid confusion), we drop the subscript n from z,X,H ,H0 but keep in mind they

always depend on n. The limiting distribution of 2 logRn is our interest. It might

not exist since ztHz may grow very quickly. For example, consider the simplest

case (though not realistic) with p = 1, βn = 1 and X = 1. Then (ut1z)2 grows

at rate n and thus 2 logRn would eventually blow up. This is expected since the

evidence supporting the alternative model accumulates as n→∞. Thus, in order

to discuss the asymptotic distribution of 2 logRn, we need some constraint on the

growth speed of ztHz. For this purpose, we assume ztH0z = Op(1), where Op(1)

means stochastic boundedness. Later we will see this assumption is indeed very

convenient. For more explanations for stochastic Big-O and Little-O notations,

see Chap. 7.3. By Lemma 2.4,

Lemma 2.5. If ztH0z = Op(1), ztHz = Op(1).

The main result is now given below. Without loss of generality, we assume X

always has full rank. The results can be easily extended to rank-deficient case but

are of very little practical interests.

39

Proposition 2.6. If ztH0z = Op(1) , then

2 logRn = −n log(1− ztHz/ztz) = ztHz + op(1).

Proof. Assuming X is full rank, by Lemma 2.2, ztz = Q0 + ztH0z = Q0 +Op(1)

where Q0 =n∑

i=p+1

(utiz)2 follows a chi-squared distribution with degree of freedom

n − p. By direct calculations we can show E(Q0/n) → 1 and Var(Q0/n) → 0.

Hence,

ztz/n = Q0/n+Op(1)/n = Q0/n+ op(1)P→ 1.

By continuous mapping theorem [der Vaart, 2000, Chap. 2.1], n/ztz = 1 + op(1).

Although ztz and ztHz are correlated, by Slutsky’s theorem [der Vaart, 2000,

Chap. 2.1], we can write

ztHz/ztz = ztHz(1 + op(1))/n = Op(1)/n = Op(1/n),

1− ztHz/ztz = 1− ztHz

n(1 + op(1)) = 1− z

tHz

n+ op(

1

n),

since ztHz = Op(1). Another way to verify this is to use the fact that ztHz/ztz <

ztHz/Q0. Since ztHz and Q0 are independent, ztHz/Q0 can be shown to be

Op(1/n). By Taylor expansion with Peano’s form of remainder,

−n log(1− ztHz/ztz) = ztHz + op(1) + n op(1/n) = ztHz + op(1).

Piecing together Proposition 2.1, Proposition 2.2, Proposition 2.6, we arrive

at the main result of this section.

40

Theorem 2.7 (Asymptotic distribution of BFnull). Let Qidef.= (utiz)2. If ztH0z =

Op(1), then

2 log BFnull =

p∑i=1

(λiQi + log(1− λi)) + op(1), Qiind.∼ χ2

1((τ(utiXβ)2).

At last, we need figure out when the condition ztH0z = Op(1) holds. It is

clearly true under the null since ztH0z ∼ χ2p. Thus the following corollary is

immediate.

Corollary 2.8. Let Qi = (utiz)2. Under the null,

2 log BFnull =

p∑i=1

(λiQi + log(1− λi)) + op(1), Qii.i.d∼ χ2

1.

This result can be further generalized to the case of non-normal errors.

Corollary 2.9. Consider the null model y = ε, where ε1, . . . , εn are i.i.d. random

variables with E[εi] = 0, Var(εi) = τ−1, and E[ε4i ] < ∞. Let Qi = (utiz)2 =

τ(utiε)2. Then

2 log BFnull =

p∑i=1

(λiQi + log(1− λi)) + op(1).

Proof. Clearly ztHz is still Op(1). The proof of Proposition 2.6 tells us that

we only need to check Q0/nP→ 1 where Q0 =

n∑i=p+1

(utiz)2, since the rest follows

from Slutsky’s Theorem and Taylor’s expansion. Note that we don’t need the

independence between ztHz and Q0 which is not true when the errors are not

41

normally distributed. Since E[τ 1/2utiε] = 0, we have

E[Q0] =n∑

i=p+1

Var(τ 1/2utiε) = n− p,

Var(Q0) =n∑

i=p+1

E[(τ 1/2utiε)4] = O(2n).

Hence we conclude Q0/nP→ 1 and complete the proof.

Under the alternative model, we have to limit the growth rate of the noncen-

trality parameter of Qi, i.e., τ(utiXβ)2, since

ztH0z ∼ χ2p(τ

p∑i=1

(utiXβ)2)

where χ2p(ρ) denotes a noncentral chi-squared random variable with degree of

freedom p and noncentrality parameter ρ. We borrow the idea of local alternatives

from frequentists’ theory and assume βn = β0/√nτ . Then,

τ

p∑i=1

(utiXβ)2 =1

n

p∑i=1

(utiXβ0)2 =1

n

n∑i=1

(utiXβ0)2

=1

n(U tXβ0)t(U tXβ0)

=1

nβt0X

tXβ0.

β0 is now fixed but it may still blow up if the entries of X grow with n. To

solve this, we need additional constraint. Two practical choices would be that

X tX/n converges or that X is bounded entrywise. The second condition is

easy to check. To see that the first condition would work, recall that a weakly

convergent sequence of random variables is always stochastically bounded [Shao,

2003, Chap. 1.6 (127)]. To summarize,

Corollary 2.10. Let Qi = (utiz)2. Under a sequence of local alternatives with

42

β = β0/√nτ , either if X tX/n converges of X is bounded entrywise,

2 log BFnull =

p∑i=1

(λiQi + log(1− λi)) + op(1), Qiind.∼ χ2

1(1

n(utiXβ0)2).

2.1.3 Asymptotic Results in Presence of Confounding

Covariates

According to Chap. 7.2.3, if an n × q matrix W , representing confounding co-

variates, need controlling for, it suffices to regress it out from both y and X and

compute BFnull with the residuals. The resulting Bayes factor is exactly the Bayes

factor for the full model (see (7.13)). It is tempting to jump to the conclusion that

the asymptotic distribution of BFnull remains the same, which is true but requires

additional arguments since the distribution of z (or y) has changed.

Some notations need to be redefined. Let P = I −W (W tW )−1W t be the

projection matrix that maps a vector to its residuals after regressing out W . Let

L be the matrix representing the covariates of interest and redefine Xdef.= PL.

Let H and H0 still be as defined in (2.2) and (2.4). If we compare the full model

y = Wa + Lβ + ε against the null model y = Wa + ε, then the expression for

2 logR becomes (see Chap. 7.2.3 for more details)

2 logR = −n log(1− ztHz/ztPz). (2.6)

where z = τ 1/2y. We may replace z by z = τ 1/2Py, and 2 logR would have

exactly the same expression as before since P is idempotent. However, we use

form (2.6) as it is more convenient. Recall that we have defined the spectral de-

composition H =p∑i=1

λiuiuti in (2.3). The distributions of z, ztPz, ztHz, ztH0z

are given in the following lemma.

43

Lemma 2.11. Assume y ∼ MVN(Wa+Lβ, τ−1I), then

(a) z ∼ MVN(τ 1/2Xβ,P ) ;

(b) ztHz =p∑i=1

λi(utiz)2 and ztH0z =

p∑i=1

(utiz)2 where (utiz)2 ∼ χ21(τ(utiXβ)2);

(c) ztPz =p∑i=1

(utiz)2 +Q0 where Q0 is some random variable that follows χ2n−p−q.

Proof. Part (a) and part (b) follow from the fact that PL = X and PW = 0. To

prove part (c), we need figure out the spectral decomposition for matrix P . Since

P is a projection matrix, it has n− q unit eigenvalues and q zero eigenvalues. For

1 ≤ i ≤ p, ui must satisfy Hui = λiui, which implies Pui = ui since PX = X.

Hence, ui is also an eigenvector of P that corresponds to eigenvalue 1. Similarly,

if Pv = 0 for some vector v, then v is also an eigenvector for H with zero

eigenvalue. As a result, we may reorder and rotate up+1, . . . ,un so that

P =

n−q∑i=1

uiuti,

and thus ztPz has the given decomposition with Q0 =n−q∑i=p+1

(utiz)2 ∼ χ2n−p−q.

Inspection shows that Proposition 2.6 still holds because Q0/nP→ 1. Corol-

lary 2.8 and corollary 2.10 hold immediately since the distribution of ztH0z hasn’t

changed. Since the results with the existence of confounding covariates are con-

sistent with the simpler model y = Xβ + ε, in the remaining sections of this

chapter I only focus on the latter model unless otherwise stated. Because usually

the intercept term, 1, is treated as a confounding covariate, by using the simpler

model we are actually assuming both X and y are centered.

44

2.2 Properties of the Bayes Factor and Its

P-value

Using the asymptotic distribution of log BFnull, we can calculate an asymptotic

p-value associated with BFnull, which we denote by PBF, and study the properties

of BFnull and PBF theoretically. In Chap. 2.2.3, 2.2.4 and 2.2.5, we simplify

the discussion by omitting the op(1) term in Theorem 2.7. We may safely do so

because when the error variance τ−1 is known, we have exactly (see Chap. 7.2)

2 log BFnull =

p∑i=1

(λiQi + log(1− λi)) , Qiind.∼ χ2

1(τ(utiXβ)2).

For a sufficiently large sample size, τ can be estimated very accurately in the sense

that its posterior distribution contracts to the true value.

2.2.1 Comparison with the P-values of the Frequentists’

Tests

We compare PBF with two non-Bayesian p-values, the p-value of the F test, de-

noted by PF, and the p-value of the LRT, denoted by PLR. We still let z = τ 1/2y

denote the standardized response variable. In frequentists’ language, usually yty

is denoted by SST (total sum of squares), yty − ytH0y is denoted by SSE (sum

of squares of errors) and ytH0y is denoted by SSReg (sum of squares due to re-

gression). Assuming both y and X have been centered, the F test statistic, by

definition is

F =SSReg/p

SSE/(n− p− 1)=

ztH0z/p

zt(I −H0)z/(n− p− 1)∼ F(p,n−p−1). (2.7)

45

The test statistic of the LRT is derived as follows,

2 log LR = 2 logsupτ,β f(y|τ,β)

supτ f(y|τ,β = 0)

= −n log(1− ytH0y/yty)

= −n log(1− ztH0z/ztz)

=∑i=1

QiD→ χ2

p.

(2.8)

This is a special case of Wilks’s [1938] theorem. Hence, F test, which is exact,

and LRT are asymptotically equivalent. The Bayes factor, as an averaged (or

penalized) likelihood ratio, has a very similar form to the LRT statistic.

A special case is simple linear regression. Recall from the result of Chap. 2.1.2

that when p = 1,

2 log BFnull = −n log(1− λ1Q1

Q1 +Q0

) + log(1− λ1) ≈ λ1Q1 + log(1− λ1).

where Q0 =n−1∑i=2

(utiz)2 ∼ χ2n−2 and Q1 = (ut1z)2. Under the null, Q1 ∼ χ2

1. We

make two observations.

Proposition 2.12. Let PBF be the asymptotic p-value calculated by Corollary 2.8,

PF be the p-value of the F-test and PLR be the p-value of the likelihood ratio test.

If p = 1, then,

(a) PBF is asymptotically equivalent to PLR;

(b) PF is the true p-value for the Bayes factor.

Proof. PBF is asymptotically equal to PLR since

2 log BFnull = λ1(2 log LR) + log(1− λ1) + op(1).

46

(BFnull is asymptotically a monotone function of LR.) The second statement is

because BFnull is a monotone function of F .

2 log BFnull = −n log

(1− λ1

1 + (n− 2)/F

)+ log(1− λ1).

To compare the three p-values, we fix λ1 = 0.8, n = 100, Q0 = E[χ298] = 98

and try different values for Q1. The result is shown in Figure 2.1. It can be seen

that for a limited sample size n = 100, PLR and PBF are very close to each other

and only deviate from the truth when the p-value is extremely small. Later in

Chap. 2.3.1 we will see how to correct the test statistics so that the asymptotic

p-values can have better calibration.

2.2.2 Independent Normal Prior and Zellner’s g-prior

Now consider two special cases of the prior covariance matrix for β, V . The first

is called the independent normal prior, which assumes βi’s to be i.i.d. normal

variables a priori. The second choice is the well-known Zellner’s g-prior [Zellner,

1986].

Proposition 2.13. Consider the independent normal prior, V = σ2I, and Zell-

ner’s g-prior, V = g(X tX)−1 with g > 0.

(a) Under both the independent normal prior and the g-prior, the columns of U

(the eigenvectors of H) are the left-singular vectors of X.

(b) Under the independent normal prior, λi = d2i /(d

2i + σ−2) where di is the i-th

singular value of X (|d1| ≥ · · · ≥ |dp| ≥ 0); under the g-prior, λi = g/(g + 1)

for 1 ≤ i ≤ p.

47

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

PF

PLR

−10 −8 −6 −4 −2 0

−10

−8

−6

−4

−2

0

log10(PF)

log 1

0(PLR

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

PF

PB

F

−10 −8 −6 −4 −2 0

−8

−6

−4

−2

0

log10(PF)

log 1

0(PB

F)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

PBF

PLR

−8 −6 −4 −2 0

−10

−8

−6

−4

−2

0

log10(PBF)

log 1

0(PLR

)

Figure 2.1: Comparisons between PBF, PF, PLR for p = 1. We use n = 100, fixSSE = 98 (SSE: sum of squares of errors) and try different values for SSReg (sumof squares due to regression).

48

(c) Under both the independent normal prior and the g-prior, Qi = (utiz)2 ind.∼

χ21(τd2

i (vtiβ)2) where vi is the i-th right-singular vector of X.

Proof. Let the singular value decomposition of X be X = U0D0Vt

0 . Then under

the independent normal prior, H = U0D0(D20 +σ−2I)−1D0U

t0. Under the g-prior,

H =g

g + 1U0U

t0. Part (a) and (b) then follows. To prove part (c), notice that

utiXβ = utiU0D0Vt

0 β = divtiβ.

Under the independent normal prior, PBF differs from PLR since it assigns

different weights to different directions (of β). The direction of the first (principal

component) loading vector of the data matrix X, which corresponds to singular

value d1, has the biggest weight. In contrast, Under the g-prior, PBF acts just like

PLR since it also treats every direction equally.

Under the g-prior, the expression for BFnull can be simplified to

BFnull = (g + 1)−p/2(

1− g

g + 1

ytH0y

yty

)−n/2= (g + 1)(n−p)/2 (1 + g(1− r2)

)−n/2,

(2.9)

where r2 is the coefficient of determination in traditional linear regression.

2.2.3 Behaviour of the Bayes Factor and Three Paradoxes

Using the asymptotic approximation,

2 log BFnull ≈p∑i=1

λiQi + log(1− λi), Qi = (utiz)2, (2.10)

49

it is relatively easy to understand the genesis of three famous paradoxes of the

Bayes factor.

Jeffreys-Lindley’s paradox For any fixed significance level of α, one can al-

ways find a sample size such that an effect is statistically significant (i.e. the

p-value is less than α) however the posterior probability of the null model is

greater than 1 − α [Lindley, 1957, Naaman, 2016]. Consider our model with the

independent normal prior. We may fix the values of Qi so that PLR is a constant

less than α and then let n goes to infinity. By Proposition 2.13, λ1, . . . , λp all go

to 1 and thus BFnull ↓ 0. Hence Bayesians would accept the null model.

Bartlett’s paradox When lacking prior information, people prefer to use the

noninformative prior, which assumes a flat shape of the parameter’s prior distri-

bution. However, using a noninformative prior can unintentionally favor the null

model [Bartlett, 1957, Liang et al., 2008]. In our model, the noninformative prior

should let the prior variance of β go to infinity. For the independent normal prior,

this means to let σ2 ↑ ∞; for the g-prior, it means to let g ↑ ∞. But in both cases,

assuming n is fixed, λ1, . . . , λp all go to 1 and thus BFnull ↓ 0.

Information paradox Under the g-prior, BFnull can be expressed using only

g, n, p, and r2 (2.9). Suppose g, n, p are fixed and let r2 ↑ 1. We then have

BFnull ↑ (g + 1)(n−p)/2. This is undesirable since as r2 ↑ 1, the evidence for the

alternative model becomes overwhelming and decisive. However, BFnull, which is

assumed to measure the evidence, converges to a finite constant. Since r2 ↑ 1

represents the accumulation of the information, this paradox is called information

paradox [Liang et al., 2008].

50

In fact, none of the three paradoxes is a true paradox. But by investigating

these paradoxes, we may better understand the nature of the Bayes factor. I

first explain why these paradoxes are indeed expected and how to solve them, and

then perform a quantitative analysis to investigate whether they should be worried

about for a finite sample size.

Jeffreys-Lindley’s paradox neglects the fact that as n ↑ ∞, the p-value of

the LRT or the F test would also go to zero. It is impractical to assume that

the observed significance level remains unchanged as n grows. Nevertheless, it is

likely that when n is large, frequentists reject the null while Bayesians accept it.

But this is simply because the evidence is not strong enough, considering the large

sample size. This is not paradoxical at all. It just reveals the different properties

of the p-value and the Bayes factor.

The phenomenon of Bartlett’s paradox is expected as well. By letting σ2 or g

go to infinity, we are actually assuming that the effect tends to be very large, which

of course cannot be supported by the data. Hence the marginal likelihood of the

alternative model, which is averaged over this noninformative prior, decreases and

eventually becomes less than the marginal likelihood of the null model. However,

Bartlett’s paradox is a truly important observation. It reveals that there is no

appropriate choice for σ2 when we have no prior information! To overcome this

problem, people often put a hyperprior on σ2, which is referred to as the Bayesian

random effect model (or Bayesian linear mixed model [Hobert and Casella, 1996]

). This is also the standard approach in variable selection, which will be discussed

in the next chapter.

The information paradox, which is truly undesirable, arises from the nature of

the g-prior. Bayesians usually prefer to use a prior that is independent of data.

But using the g-prior with a fixed g violates this rule. When n is fixed, the g-prior

51

is well-motivated. It is sometimes reasonable to assume that the prior precision

matrix of β is proportional to the covariance matrix of the data X tX since a

larger variance can be thought of as “more information”. Besides, the g-prior

is computationally convenient. However, when n grows and g is fixed, the prior

covariance of β becomes smaller and smaller, and eventually vanishes! Just like

Bartlett’s paradox, one solution is to put a hyperprior on g ( see Liang et al.

[2008]).

Next let’s try to quantitatively characterize the behaviour pattern of the Bayes

factor. We focus on the independent normal prior.

Proposition 2.14. Assume BFnull has the expression given in (2.10) and let

ρidef.= τ(utiXβ)2 so that Qi ∼ χ2

1(ρi). Then,

(a) under the null, Qii.i.d.∼ χ2

1 and

E[log BFnull | β = 0] =1

2

p∑i=1

(λi + log(1− λi)),

E[BFnull | β = 0] = 1;

(b) under a fixed alternative, Qiind.∼ χ2

1(ρi) and

E[log BFnull | β] =1

2

p∑i=1

(λi(1 + ρi) + log(1− λi)),

E[BFnull | β] =

p∏i=1

exp(λiρi

2(1− λi));

52

(c) under the alternative β ∼ MVN(0, τ−1σ2I), (1− λi)Qii.i.d.∼ χ2

1 and

E[log BFnull | σ2] =1

2

p∑i=1

(λi

1− λi+ log(1− λi)),

E[BFnull | σ2] =

∞ if λ1 ≥ 0.5

p∏i=1

1− λi√1− 2λi

if λ1 < 0.5

Proof. By the independence between Q1, . . . , Qp, it is sufficient to calculate the

expectation of a single component1

2(λiQi+log(1−λi)) or exp[

1

2(λiQi+log(1−λi))].

Part (a) and (b) then follow from routine calculations. For part (c), notice that

by Proposition 2.13 (c), Qi/(1 + σ2d2i ) ∼ χ2

1 and 1 + σ2d2i = (1− λi)−1.

Remark 2.14.1. In fact, the expectation of the null-based Bayes factor (with

proper priors) under the null model is always 1, regardless of the models of the

problem. This can be checked from the definition of the Bayes factor.

Remark 2.14.2. By Jensen’s inequality, we always have eE[log BFnull] ≤ E[BFnull].

Since BFnull has a heavy-tailed distribution, the expected values of log BFnull ac-

tually provide much more insight into the behaviour of BFnull than the expected

values of BFnull.

Remark 2.14.3. As λ1 ↑ 1 (either the sample size n ↑ ∞ or σ2 ↑ ∞), the expected

value of log BFnull goes to −∞ under setting (a) and goes to ∞ under setting (c).

Under setting (b), the limit depends on the growth rate of ρ1. Suppose σ2 is fixed,

ρ1 = O(n) and d21 = O(n), we still have the expected value of log BFnull goes to ∞

since log(1− λ1) = O(log n).

A key message in the last remark is that the term log(1− λ1) decreases quite

slowly. In the usual case where we assume the observations are homogeneous, Q1

grows at a faster rate as n increases and thus the Bayes factor is consistent (the

posterior probability of the alternative model goes to 1). In fact, log(1− λ1) also

53

decreases very slowly as σ increases. It suffices to consider only p = 1 and thus

2 log BFnull = λ1Q1 + log(1− λ1).

∂2 log BFnull

∂σ2=

Q1d21

(σ2d21 + 1)2

− d21

σ2d21 + 1

Therefore, when σ2 ↓ 0, 2 log BFnull grows at rate (Q − 1)d21; when σ2 ↑ ∞,

2 log BFnull decreases at a vanishing rate O(σ−2).

2.2.4 Behaviour of the P-value Associated with the Bayes

Factor

Now consider the behaviour of PBF. Under the null, it follows a uniform distribu-

tion (asymptotically), just like any other valid p-values. The calibration of it for a

finite sample size will be discussed later by simulation (Chap. 2.3.4). What would

be more interesting is the power performance of PBF, and in particular, how it

differs from PLR.

The power of the test given a fixed β is

Power(PBF) = P(

p∑i=1

λiQi > Cα) , Qi ∼ χ21(τ(utiXβ)2), (2.11)

where Cα is the critical value calculated from the null distribution,p∑i=1

λiχ21. For

comparison, the power of PLR is given by

Power(PLR) = P(

p∑i=1

Qi > χ2p,1−α) , Qi ∼ χ2

1(τ(utiXβ)2), (2.12)

where χ2p,1−α denotes the (1−α) quantile of χ2

p distribution. It should be noted that

in (2.11), λ1, . . . , λp appear on both sides of the inequality inside the probability

54

term. Hence the scaling of λ1, . . . , λp does not affect the power. As a result, when

p = 1, λ1 can be dropped from both sides and thus PBF and PLR have the same

power.

Proposition 2.15. For simple linear regression, the power of PBF is equal to the

power of PLR.

When p = 2 and λ1 6= λ2, the two p-values (PBF and PLR) have different power.

This difference is illustrated by the following example. Choose τ = 1, and let the

thin singular value decomposition of X be

X = U

10 0

0 2

1 0

0 1

.Denote β = (β1, β2)t. By Proposition 2.13, Q1 ∼ χ2

1(100β21) and Q2 ∼ χ2

1(4β22).

Consider the independent normal prior with σ = 0.2. We can calculate λ1 = 0.8

and λ2 = 0.14. Then we use Monte Carlo sampling (10, 000 samples per β) to

compute the power at α = 0.05. The result is shown in Figure 2.2. On the

horizontal direction, PBF is better since, to achieve a given power, it requires a

smaller value of |β1| than PLR. On the vertical direction, PLR is better. Note that

both PLR and PBF have the largest power at direction (1, 0) (the first right-singular

vector of X), because the data is most informative at that direction. However,

this bias is exaggerated for PBF due to the weights λ1, λ2.

55

Figure 2.2: Power comparison between PBF and PLR for p = 2. The singular valueof the design matrix X is set to 10 and 2, and we use σ = 0.2. The red contoursrepresent the power of PLR and the blue contours are the power of PBF. We drawthe contours at power = 0.1, 0.5, 0.8, 0.99.

2.2.5 More about Simple Linear Regression

For simple linear regression, it is possible to derive more quantitative results about

the behaviour of BFnull and PBF. Recall that when p = 1,

2 log BFnull ≈ λ1Q1 + log(1− λ1),

where Q1 ≈ 2 log LR. If Q1 is fixed, we observe that BFnull is maximized at some

value of λ1 and the corresponding prior can be calculated analytically. Since the

prior parameter V is now a 1× 1 matrix, we may write

V = [σ2].

56

0.0 0.5 1.0 1.5 2.0σb

log 1

0 (B

F)

2.0

2.5

3.0

3.5

4.0

n = 200n = 500n = 1000

Figure 2.3: How BFnull changes as σ ranges from 0.05 to 2. We assume X hasunit variance and thus X tX = n. Fix Q1 = 24, which corresponds to a p-valueequal to 10−6.

Proposition 2.16. Consider simple linear regression. Assuming Q1 is given, then

maxσ2

2 log BFnull = Q1 − 1− logQ1;

arg maxσ2

BFnull =Q1 − 1n∑i=1

x2i

.

Proof. By differentiating log BFnull w.r.t λ1, we get

Q1 − 1

Q1

= arg maxλ1

BFnull.

Since λ1 =∑x2i /(∑x2i + σ−2), we obtain the result.

Figure 2.3 shows how BFnull changes as σ ranges from 0.05 to 2 with Q1 = 24

(PBF = 10−6).

Using this result, we can also quantify the numerical relationship between

PBF (which is equal to PLR by Proposition 2.12) and BFnull. Statisticians have

57

been very interested in this relationship because the Bayes factor can be used

to calculate the posterior probability of the alternative model, which is often

compared with the p-value. Certainly this numerical relationship depends on

the model and the particular hypothesis testing method, but in most cases, the

p-value is numerically “more significant” than the Bayes factor. For example, see

Good [1992], Berger and Sellke [1987] and Sellke et al. [2001], among others.

Proposition 2.17. For simple linear regression, let P = PBF = PLR. For suffi-

ciently large Q1,

− logP ≈ maxσ2

log BFnull + logQ1 + 0.73

Proof. From the last proposition, we have max 2 log BFnull = Q1−1−logQ1. Thus

we only need to establish the relationship between Q1 and P . Since Q1 is equal

to the test statistic of the LRT,

P =

∫ ∞Q1

1√2πx−1/2e−x/2dx

=2√2πQ−1/21 e−Q1/2 −

∫ ∞Q1

x−1

√2πx−1/2e−x/2dx.

Since 1/x ≤ 1/Q1 for x ≥ Q1, we have

2√2πQ−1/21 e−Q1/2 − 1

Q1

P ≤ P ≤ 2√2πQ−1/21 e−Q1/2.

Let fχ21

be the density function of χ21. After rearrangement we obtain

2Q1

Q1 + 1fχ2

1(Q1) ≤ P ≤ 2fχ2

1(Q1).

Clearly, for sufficiently large Q1, P ≈ 2fχ21(Q1). The result then follows by doing

58

some algebra. The constant in the formula corresponds to

1

2(1 + log π − log 2) ≈ 0.73.

We may only use P ≤ 2fχ21(Q1) to derive the inequality

− logP ≥ maxσ2

log BFnull + logQ1 + 0.73.

This implies that P−1 is always greater than BFnull. Besides, in practice, the value

of σ2, which should be chosen before the testing, is usually not the “optimal” value

that maximizes the Bayes factor. Hence a p-value equal to 10−7 often corresponds

to a Bayes factor around 105.

2.3 Computation of the P-values Associated

with Bayes Factors

By Theorem 2.7, we can calculate the asymptotic p-value for the Bayes factor.

However, there are two challenges. First, since this p-value, PBF, is asymptotic,

can we still trust it for a finite sample size? In Chap. 2.3.1 a correction method

is introduced to improve the calibration of PBF when n is only moderate. Second,

the numerical computation requires us to evaluate the distribution function of a

linear combination of χ21 random variables, which has been a difficult problem for

a long time. We have implemented a new method, proposed by Bausch [Bausch,

2013], that has only polynomial complexity in the number of χ21 random variables.

59

2.3.1 Bartlett-type Correction

Denote the test statistic for calculating PBF by 2 logR which has asymptotically

the same distribution as a weighted sum of independent χ21 random variables with

weights λ1, . . . , λp. For a moderate sample size, a correction method for 2 logR to

improve the calibration of PBF can be developed. We borrow the idea of Bartlett-

type correction to the LRT statistic, which was first noticed by Bartlett [1937]

and later generalized by Box [1949] and Lawley [1956]. By Wilks’ theorem, the

likelihood ratio test statistic, denoted by Λ, converges weakly to a χ2p-distributed

random variable at rate o(1) [Wilks, 1938]. For a small sample size, the calibration

of this p-value can be very poor. Suppose we have an estimator, E0[Λ], that esti-

mates the expected value of Λ under the null with error as small as Op(n−3/2). The

corrected test statistic, pΛ/E0[Λ], converges weakly to a χ2p-distributed random

variable at rate O(n−2), under very general conditions [Bickel and Ghosh, 1990].

This strategy can be used to introduce a heuristic correction for 2 logR.

Consider the general model with confounding covariates. This is because q

(the number of confounding covariates) has to appear in the correction term.

By Lemma 2.11, under the null, 2 logR can be expressed using independent chi-

squared random variables

2 logR = −n log(1−

p∑i=1

λiQi

Q0 +p∑i=1

Qi

) = n log(1 +

p∑i=1

λiQi

Q0 +p∑i=1

(1− λi)Qi

),

Q0 ∼ χ2n−p−q,

Qi ∼ χ21 for 1 ≤ i ≤ p.

Asymptotically 2 logR has the same distribution as∑

i λiQi, of which the expec-

tation is∑

i λi. To apply Bartlett-type correction, we need find a higher-order

60

approximation for E0[2 logR]. Define

Adef.=

p∑i=1

λiQi

Bdef.=

p∑i=1

(1− λi)Qi

Note that A and B are not independent. By Taylor expansion,

2 logR =nA

Q0 +B− nA2

2(Q0 +B)2+

nA3

3(Q0 +B)3+ oP (n−2).

Since B/Q0P→ 0, we can apply Taylor expansion again,

nA

Q0 +B=nA

Q0

(1− B

Q0

+B2

Q20

) + oP (n−2),

nA2

2(Q0 +B)2=− nA2

2Q20

(1− 2B

Q0

) + oP (n−2)),

nA3

3(Q0 +B)3=nA3

3Q30

+ oP (n−2).

We group the terms according to their orders and direct calculation gives

γ1def.= E[

nA

Q0

] =nα1

β1

,

γ2def.= E[

n

Q20

(−AB − A2

2)] =

−n(2α3 + α1α2)

β1β2

,

γ3def.= E[

n

Q30

(AB2 + A2B +A3

3)] =

n

β1β2β3

[8

3α5 + 2α1α4 +

1

3α3

1 + (4 + p)α1α6 + (8 + 2p)α7],

where by abuse of notations,

βi = n− p− q − 2i , α1 =p∑i=1

λi , α2 =p∑i=1

(1− 1

2λi) , α3 =

p∑i=1

λi(1−1

2λi)

α4 =p∑i=1

λ2i , , α5 =

p∑i=1

λ3i , α6 =

p∑i=1

(1− λi) , α7 =p∑i=1

λi(1− λi).

(2.13)

61

Combining them, we have, for k = 1, 2, 3,

E0[2 logR] =k∑i=1

γi + o(n−k+1). (2.14)

Hence, E0[2 logR] can be estimated by

E(k)[2 logR]def.=

k∑i=1

γi.

In addition to 2 logR, we now have obtained three corrected test statistics, α1(2 logR)/E(k)[2 logR]

(for k = 1, 2, 3). Similarly, the LRT statistic could be corrected by using

E[2 log LR] =np

β1

− np(p+ 2)

2β1β2

+np(p+ 2)(p+ 4)

3β1β2β3

+ o(n−2), (2.15)

where β1, β2, β3 are as defined in (2.13).

2.3.2 Bausch’s Method

The current most popular method is Davies’ method, which relies on the numerical

inversion of the characteristic function. See Chap. 7.4.1 for a brief introduction. It

is convenient, not difficult to implement, but involves the numerical integration of

a highly oscillatory integrand which might produce only limited accuracy [Bausch,

2013].

Bausch [2013] proposed to calculate the distribution function of a linear com-

bination of independent χ21 random variables by Taylor expansion. His method

has only (at most) polynomial complexity in the number of χ21 random variables.

Furthermore, the error bound can be explicitly evaluated and arbitrary accuracy

can be obtained. I make a brief introduction of his algorithm in the current section

and describe in details how we implemented it in C++ in the next section.

62

Let X1, . . . , Xp be i.i.d. χ21 random variables. We are interested in the distri-

bution function of the weighted sum,

Q =

p∑i=1

λiXi, (2.16)

where the weights λ1 ≥ · · · ≥ λp > 0. For the time being, let’s assume p is even

and hence we may rewrite (2.16) as

Q =

p/2∑k=1

Yk, Ykdef.= λ2k−1X2k−1 + λ2kX2k.

Bausch noticed that, by Kummer’s second transformation [Abramowitz and Ste-

gun, 1964, Chap. 13],

fY1(y) =1

(4λ1λ2)1/2exp(−λ1 + λ2

4λ1λ2

y)I0(λ1 − λ2

4λ1λ2

y),

where I0 is the modified Bessel function of the first kind with degree 0 [Abramowitz

and Stegun, 1964, Chap. 9]. The function I0 can be computed via Taylor expan-

sion,

I0(x) =∞∑i=0

(x/2)2i

(i!)2.

Hence, we may write

I0(λ2k−1 − λ2k

4λ2k−1λ2k

y) = Tk(y) +Rk(y),

where Tk(y) is a finite power series and Rk(y) is the corresponding remainder term.

Multiply both Tk and Rk by the constant and the exponential term, and express

63

fYk by

fYk(y) = Tk(y) + Rk(y),

Tk(y) =1

(4λ2k−1λ2k)1/2exp(−λ2k−1 + λ2k

4λ2k−1λ2k

y)Tk(y),

Rk(y) =1

(4λ2k−1λ2k)1/2exp(−λ2k−1 + λ2k

4λ2k−1λ2k

y)Rk(y).

A key observation made by Bausch is that the convolution of functions like e−βxxn

is a “closed” operation:

∫ c

0

e−βxxndx =n!

βn+1[1− e−βc

n∑i=0

(βc)i

i!].

The integrand and the integral have the same form and the same highest degree.

Therefore, we can perform the convolution of T1(y), . . . , Tp/2(y) algebraically. Such

algebraic operations can be coded without much difficulty using mathematical

programming language like Mathematica and Maple. However, we coded in C++

and hence had to define our own math objects. The complexity of this algorithm

is polynomial in p.

To make the algorithm useful in practice, we need find an error bound. Let ∗

denote the convolution. By the associativity of convolution,

P(Q ≤ c) =

∫ c

0

(T1 + R1) ∗ · · · ∗ (Tp/2 + Rp/2)

=

∫ c

0

∑nk∈0,1

(T n1

1 · R1−n11

)∗ · · · ∗

(Tnp/2

1 · R1−np/2

1

).

By Young’s inequality for convolutions [Hardy et al., 1952, Chap. 4],

||(T n1

1 · R1−n11

)∗ · · · ∗

(Tnp/2

1 · R1−np/2

1

)||1 ≤

p/2∏k=1

||T nkk · R

1−nkk ||1,

64

where || · ||1 denotes the `1-norm. If p = 4, then we have,

P(Q ≤ c) =

∫ c

0

(T1 + R1) ∗ (T2 + R2)

=

∫ c

0

(T1 ∗ T2 + T1 ∗ R2 + R1 ∗ T2 + R1 ∗ R2

)≤(∫ c

0

T1 ∗ T2

)+

∫ c

0

T1

∫ c

0

R2 +

∫ c

0

R1

∫ c

0

T2 +

∫ c

0

R1

∫ c

0

R2

≤(∫ c

0

T1 ∗ T2

)+

∫ c

0

R2 +

∫ c

0

R1 +

∫ c

0

R1

∫ c

0

R2

≤(∫ c

0

T1 ∗ T2

)+ (1 +

∫ c

0

R1)(1 +

∫ c

0

R2)− 1.

The second last inequality follows from the fact that ||Tk||1 < 1 since fYk is a

probability density function. This result could be easily generalized to any even

p:

P(Q ≤ c) ≤(∫ c

0

T1 ∗ · · · ∗ Tp/2)

+

p/2∏k=1

(1 +

∫ c

0

Rk(y)dy)

− 1. (2.17)

Hence we may calculate P (Q ≤ c) by

P (Q ≤ c) ≈∫ c

0

T1 ∗ · · · ∗ Tp/2,

with error bound p/2∏k=1

(1 +

∫ c

0

Rk(y)dy)

− 1.

Note that in the derivation of (2.17), we actually don’t have to use the fact ||Tk||1 <

1, and the error bound would be

p/2∏k=1

(

∫ c

0

Tk(y)dy +

∫ c

0

Rk(y)dy)

−p/2∏k=1

∫ c

0

Tk(y)dy.

65

We might also use the fact∫ c

0Tk < FYk(c) to derive another error bound,

p/2∏k=1

(FYk(c) +

∫ c

0

Rk(y)dy)

−p/2∏k=1

FYk(c). (2.18)

When the number of χ21 random variables is odd, we only need to first calculate

the distribution function of the weighted sum of the p − 1 χ21 random variables

and then perform one numerical integration.

2.3.3 Implementation of Bausch’s Method

We implemented Bausch’s method in C++ to gain maximum speed. The source

code and executables of our program BACH (Bausch’s Algorithm for CHi-square

weighted sum) are freely available at http://haplotype.org. We used the GNU

Multiple Precision Arithmetic Library (GMP) so that we can use arbitrary-precision

floating-point numbers to accurately calculate extremely small p-values. By de-

fault, the floating-point numbers in our program have 76 effective digits.

We now lay out the implementation details. Our goal is to use as few as

possible Taylor expansion terms to achieve a desired precision. First, we sort the

weights so that λ1 ≥ λ2 ≥ · · · ≥ λp. If p is odd, the smallest weight is held for

the numerical integration at last, which is simply done by a weighted sampling

scheme. The numerical integration almost never introduces additional noticeable

error to the p-value due to its high precision and the ordering of the weights. The

main reason for this ordering, nevertheless, is that we use Taylor expansion to

approximate

I0(λ2k−1 − λ2k

4λ2k−1λ2k

y)

66

and thus we want to make (λ2k−1−λ2k)/4λ2k−1λ2k as small as possible (I0(x) grows

fast as x increases). In fact, if we don’t order the weights, we are more likely to

encounter the integration of e−βxxn with extremely small β in the convolution

of T1, . . . , Tp/2. This would also make the algorithm numerically unstable. Now

we explain how to find out the required Taylor expansion degrees. By Taylor’s

Theorem, we can control

Rk(y) ≤ δ(Tk(y) +Rk(y)), ∀0 < y < c.

(The method will be given later.) Then,

FYk(c) = P(Yk < c) =

∫ c

0

[Tk(y) + Rk(y)]dy ≥ 1

δ

∫ c

0

Rk(y)dy.

By (2.18), the error bound is given by

Err(c)def.= P(Q ≤ c)−

(∫ c

0

T1 ∗ · · · ∗ Tp/2)

≤p/2∏k=1

(FYk(c) +

∫ c

0

Rk(y)dy)−p/2∏k=1

FYk(c)

≤p/2∏k=1

(FYk(c) + δFYk(c))−p/2∏k=1

FYk(c)

≤[(1 + δ)p/2 − 1

] p/2∏k=1

FYk(c)

≈ pδ

2

p/2∏k=1

FYk(c)

Hence, to control the error, we only need to choose δ such that

δ ≥ 2 Err(c)

p

p/2∏k=1

FYk(c)

−1

.

67

Since usually we care more about the relative error instead of the absolute error, we

first calculate a lower bound for the p-value by the method described in Chap. 7.4.2

(Eq. (7.24)) and then use it to determine the value of Err(c).

In practice, small p-values are of most interest. For example, in GWAS, the

significance threshold is typically 5 × 10−8, after correction for multiple compar-

isons. Here we describe a trick that makes the calculation of extremely small

p-values more efficiently. Note that the tail probability of Q should have the form

P(Q > c) =

p/2∑k=1

exp(−Akc)Gk(c),

where Ak ∈ R and Gk(c) is an infinite power series. However, as we omit Rk(y)

in the Taylor expansion of modified Bessel functions, we end up with

P(Q > c) ≈ ε+

p/2∑k=1

exp(−Akc)G′k(c),

where G′k(c) is a finite power series and ε 6= 0. We might call ε the limiting

error since as c → ∞ (the p-value then should vanish), it is the error of the p-

value computed using our algorithm. In our implementation, we simply neglect

ε when c is large. Note that we cannot always neglect ε since its existence is to

offset the error introduced by Taylor expansion. From our tests, the extremely

small p-values obtained by omitting ε appear to be very accurate. To see the

reason, note that when c is large, the amount we have omitted in the power series,

Gk(c) − G′k(c), has very little influence on the p-value, since the order of the

magnitude of exp(−Akc)Gk(c) is dominated by e−Akc.

In theory, it is likely that our algorithm fails to produce a correct or highly

accurate p-value. For instance, the maximum degree of the Taylor expansion of I0

is set to 160 in our software (it can be modified by the user), but when c is large, it

68

might not be enough to produce a desired relative error bound. Another possible

scenario is that the p-value is so small that we have to discard ε, as described in

the last paragraph. In such cases, we calculate the lower and the upper bound

of the p-value by (7.25) in Chap. 7.4.2. These bounds can always be quickly and

exactly evaluated. When p is large or the p-value is very small, these bounds

already meet most practical needs. If our p-value fails to be within the bounds,

which never occurred in our simulation or tests, we use the bounds to estimate the

p-value. Otherwise we trust our p-value but report the error bound by comparing

the p-value with the bounds.

2.3.4 Calibration of the P-values

Using our asymptotic results, we can evaluate extremely small p-values for Bayes

factors, which is an important advantage in applications such as GWAS compared

to the permutation method described in Servin and Stephens [2007]. However,

our PBF is an asymptotic p-value, and hence its calibration for moderate sample

sizes need to be examined. Since LRT is one of the most widely used asymptotic

tests, the calibration of PBF is compared with that of PLR.

A GWAS dataset (IOP) is used for simulation. The details of the IOP dataset

are given in Chap. 7.7.1. Sample size n is chosen to be 100, 300, 1000 and the

number of covariates p is set to 10 or 20. For every combination of n and p,

a subset of genotypes of n individuals and p SNPs is randomly sampled, and

y is simulated under the null model, y ∼ MVN(0, In). Then PLR and PBF are

computed (σ = 0.2). This step is repeated for 107 times. Fig. 2.4 and Fig. 2.5 show

that PBF is well calibrated, and the calibration is usually better than PLR at the

tail. The performance of the corrected test statistics is also investigated. For both

tests, a third-order approximation to the expected value of the test statistic under

69

Figure 2.4: Calibration of PBF and PLR for p = 10. The red dots represent PLR

from the likelihood ratio test and the blue represent pB. The grey region indicatesa 95% confidence band, which is calculated using the fact that a order statisticfrom a uniform distribution follows a Beta distribution.

the null (see Eq. (2.14) and (2.15)) is used. Simulations show that the Bartlett-

type correction improve the calibration of the p-values substantially on the linear

scale but not much on the logarithmic scale. Overall, it can be concluded that PBF

is well-calibrated when the sample size is more than a few hundred. Furthermore,

as far as the tail calibration is concerned, the uncorrected test statistic can simply

be used for computing PBF while for PLR the correction seems important for a

small sample size.

70

Figure 2.5: Calibration of PBF and PLR for p = 20. The red dots represent PLR

from the likelihood ratio test and the blue represent pB. The grey region indicatesa 95% confidence band, which is calculated using the fact that a order statisticfrom a uniform distribution follows a Beta distribution.

71

Chapter 3

A Novel Algorithm for

Computing Ridge Estimators

This chapter introduces a novel algorithm for computing ridge regression estima-

tors, which is a critical step in the calculation of Bayes factors given in Chap. 2.

This algorithm could be applied to the Bayesian variable selection and substan-

tially boosts the MCMC sampling.

3.1 Background

Consider the linear regression with the response variable y = (y1, . . . , yn) and an

n× p design matrix X,

y = Xβ + ε.

The errors vector ε1, . . . , εn are i.i.d. with expectation 0 and variance τ−1. The

ridge regression estimator for the coefficient vector β is obtained via a type of

72

Tikhonov regularization [Tikhonov and Arsenin, 1977],

βR(λ) = arg minβ||y −Xβ||2 + c||β||2

= (X tX + λI)−1X ty

(3.1)

where || · || denotes `2 norm. Comparing with the ordinary least squares estimator

(denoted by βLS) given in Proposition. 7.14, it can be seen that the only difference

in the objective function is the penalty term c||β||2. The constant c ≥ 0 is usually

referred to as the regularization or the shrinkage parameter since it forces β to

be closer to 0. If c = 0, it reduces to the least-squares fitting. Initially the ridge

regression was proposed for the cases where X tX is ill-conditioned and βLS is

unstable due to large variance or even numerically cannot be computed [Hoerl

and Kennard, 1970a,b]. The most important advantage of the ridge estimator can

be explained by the following decomposition of the mean squared error (MSE):

E[(y − y)2

]= (E[y]− y)2 + Var(y) + τ−1.

The term, E[y] − y, is called the bias. When βLS is used, the bias is clearly

zero since βLS is the best linear unbiased estimator (BLUE) by Gauss Markov

theorem [Plackett, 1950]. However, the MSE for βLS is not necessarily small due

to the variance term. The ridge estimator, assuming c > 0, on the contrary is

always biased but the variance might be small if c is chosen appropriately. In fact,

there always exists a c such that the ridge estimator βR(c) attains a smaller mean

squared error than βLS [Hoerl and Kennard, 1970a], which is sometimes known

as the bias-variance tradeoff.

Though ridge regression is a non-Bayesian method, the ridge estimator plays a

fundamental role in the Bayesian linear regression. Consider the Bayesian linear

73

regression model defined in (1.1) with the independent normal prior V = σ2I.

First, βR is the maximum a posteriori (MAP) estimator (see Eq. (1.2)) and the

posterior mean with the shrinkage parameter c = σ−2. Second, the calculation of

βR is the “rate-determining” step in computing the Bayes factor (see Eq. (1.3)),

which is central to a Bayesian variable selection procedure. When we have an

extremely large sample space, the efficiency of a MCMC sampling procedure for

Bayesian variable selection hinges largely on whether the ridge estimators can be

computed quickly. See Ishwaran and Rao [2005] for a discussion on the relationship

between ridge regression and Bayesian variable selection.

3.2 Direct Methods for Computing Ridge

Estimators

The generalized ridge estimator [Draper and Van Nostrand, 1979], denoted by β

henceforth in this chapter (the subscript R dropped for simplicity), is obtained by

solving

(X tX + Σ)β = z, (3.2)

where z = X ty and Σ is some diagonal matrix with nonnegative diagonals.

Clearly the ridge estimator defined in (3.1) is a special case with

Σ = σ−2I.

The length of β is still denoted by p. Define

Adef.= X tX + Σ, (3.3)

74

which is assumed to be invertible henceforth. The methods for solving (3.2) can be

divided into two groups: direct methods and iterative methods. In this section,

I first introduce the direct methods. Each of them has some advantages under

certain circumstances. Some methods make advantages of the structure of A

while some are just usual methods for solving systems of linear equations. Since

in our main application (variable selection for GWAS), the matrix X is usually

an n × p matrix with n p, methods designed for rank-deficient X tX are not

discussed.

3.2.1 Spectral Decomposition of X tX

SinceX tX is positive semi-definite, it admits the spectral decomposition (Chap. 7.1.3),

X tX = UΛU t.

If Σ = σ−2I, we then have

X tX + Σ = U(Λ + σ−2I)U t.

Therefore, we can obtain the inverse of A and compute β by

β = U(Λ + σ−2I)−1U tz.

One advantage of this method is that if want to evaluate β for many different

values of σ, we only need to perform the spectral decomposition once and every

new evaluation for β costs ∼ 4p2 flops (floating-point operations). But later we

will see there is a better method for this purpose. Second, if we are computing

the Bayes factor defined in (1.3) with the independent normal prior, we can also

75

analytically evaluate its null distribution, since we have actually obtained the

singular values of X .Moreover, if we obtain this spectral decomposition by the

singular value decomposition (Chap. 7.1.2) ofX, the behaviour of the Bayes factor

under the alternatives can be quantified too. This is a unique advantage of this

method.

Unfortunately, this method is probably the slowest. The standard approach

to computing the spectral decomposition of X tX is to perform SVD of either X

or X tX. For a square matrix, SVD has a time complexity cubic in p. The exact

count of flops cannot be determined since it depends on both the algorithm and

the accuracy. For example, for an n×p matrix, we need about 4np2−4p3/3+O(p2)

flops by Golub-Kahan algorithm [Trefethen and Bau III, 1997, Lec. 31] .

3.2.2 Cholesky Decomposition of X tX + Σ

The Cholesky decomposition [Trefethen and Bau III, 1997, Chap. IV] is another

standard way to solve systems of linear equations. Since A is positive definite, it

can be decomposed into

A = LLt,

where L is a lower triangular real matrix. Once we obtain L, we can quickly

compute Ltβ by forward substitution and then β by backward substitution. Both

substitutions require ∼ p2 flops. The Cholesky decomposition, though has cubic

time complexity, is much faster than the SVD and only needs ∼ p3/3 flops [Tre-

fethen and Bau III, 1997, Lec. 23]. It is the fastest among the four methods

introduced in this section.

76

3.2.3 QR Decomposition of the Block Matrix[X t Σ1/2

]tAny n × p real matrix can be factorized into an n × n orthogonal matrix and

a n × p upper triangular matrix. This is called the QR decomposition. The QR

decomposition is slower than the Cholesky decomposition but faster than the SVD.

Consider the QR decomposition of the block matrix[X t Σ1/2

]t,

X

Σ1/2

= QR, (3.4)

where Q is (n+ p)× p and R is p× p. Notice that

[X t Σ1/2

] X

Σ1/2

= RtQtQR = RtR = X tX + Σ = A.

Since R is upper triangular, now we can compute β by one backward substitution

and one forward substitution, just like in the last method using the Cholesky

decomposition. The flop count for this QR decomposition is ∼ 2np2 − 4p3/3 by

Householder transformation [Trefethen and Bau III, 1997, Lec. 10]. In fact, this is

acceptable even compared with the Cholesky decomposition since we don’t need

to compute X tX, which requires ∼ np2 flops. But in practice usually X tX (or

part of X tX) can be precomputed and stored in the memory, and the update of

X tX can be performed very efficiently.

In some situations, this method is most advantageous. Consider a variable

selection procedure with a fixed regularization parameter, i.e., Σ is a diagonal

matrix with a constant scaling factor. At every step, we may add or delete a

column from X and β need to be recomputed. There is no easy way to update

the Cholesky decomposition of A and thus it must be recomputed at every step.

77

However, the QR decomposition of the block matrix in (3.4) is easy to obtain by

updating. For example, the removal of the k-th column fromX corresponds to the

removal of the k-th column and the (n+k)-th row of the matrix[X t Σ1/2

]t. The

new QR decomposition can be very efficiently computed using Givens rotation to

introduce zeroes. See Golub and Van Loan [2012, Chap.12.5] for more details.

3.2.4 Bidiagonalization Methods

Numerically the singular value decomposition is often computed via a two-stage

algorithm. The first step is called bidiagonalization, which is very similar to SVD

except that in the middle of the decomposition is an upper bidiagonal matrix

instead of a diagonal. This step can be done within a finite number of opera-

tions. The second step is an iterative procedure for finding all the singular values.

Though each iteration is fast, we may need a very large number of iterations if the

distribution of the singular values is extreme or a very high accuracy is needed.

Elden [1977] noticed that the second iterative stage could be avoided. He devel-

oped an algorithm for computing β using only the bidiagonalization, of which the

number of operations is of the same order of magnitude as for the SVD-based

algorithms (the first method introduced in this section). Therefore, this method

might be preferred when we have a fixed X but want to compute β for many

different choices of Σ. Nevertheless, the bidiagonalization is still very slow. It

requires ∼ 4np2 − 4p3/3 flops by Golub-Kahan algorithm and ∼ 2np2 + 2p3 flops

by Lawson-Hanson-Chan algorithm [Trefethen and Bau III, 1997, Lec. 31].

78

3.3 Iterative Methods for Computing Ridge

Estimators

Iterative methods, in contrast to direct methods, produce a sequence of approxi-

mate solutions, β(1), . . . , β(k), . . . , such that limk→∞

β(k) = β under some conditions.

If the convergence is quick, iterative methods are much more efficient than direct

methods since each iteration usually has time complexity O(p2).

3.3.1 Jacobi, Gauss-Seidel and Successive

Over-Relaxation

Recall our objective is to solve Aβ = z. Suppose we have the following decompo-

sition for A:

A = M +N .

where M is invertible. Then,

Mβ = −Nβ + z ⇔ β = M−1(−Nβ + z

). (3.5)

This relationship inspires us to compute β by an iterative procedure. We start

from an initial guess β(0) and in each iteration we improve our guess by

β(k+1) = M−1(−Nβ(k) + z

). (3.6)

Of course this approach does not necessarily work. It is possible that β(k) becomes

worse and worse and eventually diverges to infinity. To study its convergence

79

properties, let’s define the error vector at the k-th iteration by

e(k) = β(k) − β. (3.7)

Combining (3.5), (3.6) and (3.7) yields to

e(k+1) = (−M−1N )e(k).

Since limk→∞

β(k) = β if and only iff limk→∞

e(k) = 0, we have the following theorem

(see Golub and Van Loan [2012, Chap. 10.1.2] for a rigorous proof).

Theorem 3.1. (Convergence of standard iterations) Suppose both A and

M are invertible. Denote the spectral radius of −M−1N by

ρ(−M−1N )def.= max|λ| : λ is an eigenvalue of −M−1N.

If ρ(−M−1N ) < 1, then for any initial guess β(0), the sequence β(k) defined

by (3.6) converges to the true solution β.

Furthermore, the smaller the spectral radius of M−1N , the faster the error

vanishes. Now we are ready to introduce three standard iterative methods. We

split the matrix A into three parts by

A = L+D +U ,

where L is the strictly lower triangular component, U is the strictly upper trian-

gular component and D contains only the diagonals. Then we can define three

80

iterative procedures by

Jacobi method: β(k+1) = D−1[−(L+U)β(k) + z

];

Gauss-Seidel method: β(k+1) = (D +L)−1(−Uβ(k) + z

);

successive over-relaxation: β(k+1) = (D + ωL)−1[−(ωU − (1− ω)D)β(k) + ωz

].

The Jacobi method does not necessarily converge. One sufficient condition for the

convergence of the Jacobi method is strict diagonal dominance, i.e.,

|aii| >∑j 6=i

|aij|, ∀i = 1, . . . , p.

For the other two methods, we have a more convenient result.

Proposition 3.2. Suppose A is symmetric and positive definite. Then,

(a) the Gauss-Seidel method always converges to the true solution;

(b) the successive over-relaxation method converges for ω ∈ (0, 2).

See Golub and Van Loan [2012, Chap. 10.1.2] and Allaire et al. [2008, Chap. 8.2.3]

for proofs. In fact, the successive over-relaxation can be seen as a generalization

of the Gauss-Seidel method. It may be derived using the equality ωAβ = ωz.

In each iteration of the successive over-relaxation, we update our estimate by an

weighted average of the last guess and the Gauss-Seidel update. The relaxation

parameter, ω, acts as the weight. When ω is chosen appropriately, the succes-

sive over-relaxation can achieves a faster convergence rate than the Gauss-Seidel

method. However, unless the matrix A has a very nice structure, it is usually very

difficult to find the optimal value for ω (clearly we don’t want to compute all the

eigenvalues of A).

When implementing these methods, we should take advantage of the sparsity

81

of the matrices L, U and D. The updating equations for the three methods can

be rewritten as

Jacobi method: β(k+1)i =

1

aii(zi −

i−1∑j=1

aijβ(k)j −

p∑j=i+1

aijβ(k)j );

Gauss-Seidel method: β(k+1)i =

1

aii(zi −

i−1∑j=1

aijβ(k+1)j −

p∑j=i+1

aijβ(k)j );

successive over-relaxation: β(k+1)i =

ω

aii(zi −

i−1∑j=1

aijβ(k+1)j −

p∑j=i+1

aijβ(k)j ) + (1− ω)β

(k)i .

(3.8)

Hence, for all the three methods, each iteration costs about 2p2 flops.

3.3.2 Steepest Descent and Conjugate Gradient

Another important class of iterative methods is called Krylov subspace meth-

ods [Trefethen and Bau III, 1997, Lec. 38]. The idea is to find an approximate

solution, β(k), in the subspace

spanz,Az,A2z, . . . ,Ak−1z. (3.9)

Steepest descent is such an algorithm. Let

r(k) = z −Aβ(k)

be the “residual” at the k-th iteration. We update β(k) and r(k) together by

β(k+1) = β(k) +rt(k)r(k)

rt(k)Ar(k)

r(k);

r(k+1) = z −Aβ(k+1).

82

Just like the name suggests, steepest descent searches for the next estimate along

the current gradient. In practice, it is usually used as an optimization algorithm to

find the local maximum (minimum). However, it is rarely used for solving systems

of linear equations due to its bad convergence properties. Instead, conjugate

gradient, which is another Krylov subspace method, is often preferred. Conjugate

gradient relies on a conjugate sequence of vectors t(1), . . . , and updates β(k), r(k)

and t(k) by

t(k+1) = r(k) +rt(k)r(k)

rt(k−1)r(k−1)

t(k);

β(k+1) = β(k) +rt(k)r(k)

tt(k+1)At(k+1)

t(k+1);

r(k+1) = r(k) −rt(k)r(k)

tt(k+1)At(k+1)

At(k+1).

The initial values for r and t are given by

r(0) = t(1) = z −Aβ(0).

Conjugate gradient usually performs well when the matrix A is large and sparse.

Theoretically speaking, it can also be viewed as a direct method since it always con-

verges to the true solution within p iterations up to the rounding-off error. See Tre-

fethen and Bau III [1997, Lec. 38] and Golub and Van Loan [2012, Chap. 10.2] for

more information.

83

3.4 A Novel Iterative Method Using Complex

Factorization

In this section, a new method for solving (3.7) is proposed. It is iterative, just

like the Gauss-Seidel method and can be generalized by introducing a relaxation

parameter. But unlike all the iterative methods discussed in the last section,

this method relies on the Cholesky decomposition of X tX and makes use of the

special structure of the matrix A. In some applications like variable selection,

the Cholesky decomposition of X tX can be obtained by updating very efficiently,

whereas there is no easy to way to update the Cholesky decomposition of A if

the diagonals of Σ change. Our idea is based on a “complex factorization” of the

matrix A and thus we call our method ICF (Iterative solutions using Complex

Factorization).

3.4.1 ICF and Its Convergence Properties

Assume the Cholesky decomposition of the Gram matrix X tX is available and

given by

X tX = RtR,

where R is an upper triangular matrix. Then we have A = RtR+ Σ. Define

D = RtΣ1/2 −Σ1/2R,

H = (Rt − iΣ1/2)(R+ iΣ1/2).

(3.10)

84

One can check that A = H− iD. According to (3.6), we may iteratively compute

β by

β(k+1) = H−1(iDβ(k) + z).

Two important observations are made. First, since H is a product of triangu-

lar matrices, calculating the right-hand side is quick by forward and backward

substitutions. Second, since β is real, we may discard the imaginary part of the

right-hand side in each iteration. Thus, the estimate for β is updated by

β(k+1) = Re[H−1(iDβ(k) + z)

]. (3.11)

Discarding the imaginary part turns out to substantially expedite the convergence.

Just like the successive over-relaxation, we can define a more general iterative

procedure by introducing a relaxation parameter ω ∈ (0, 1],

β(k+1) = Re[(1− ω)β(k) + ωH−1(iDβ(k) + z)]. (3.12)

When ω = 1, (3.12) reduces to (3.11). The iterative method defined by (3.12)

is referred to as ICF (Iterative solutions using Complex Factorization). Each

iteration requires about 6p2 flops (a matrix-vector multiplication plus two complex

backward/forward substitutions). We still use e(k) to denote the error at the k-th

iteration. It can be shown that e(k+1) = Ψ(ω)e(k), where

Ψ(ω) = Re[(1− ω)I + iωH−1D].

By Golub and Van Loan [2012, Theorem 10.1.1], we have the following proposition.

Proposition 3.3. The ICF method defined by (3.12) converges if and only if

85

ρ(Ψ(ω)) < 1 where ρ denotes the spectral radius.

The next theorem provides the theoretical guarantee of the convergence of ICF.

Theorem 3.4. The convergence of ICF can always be obtained by choosing an

appropriate relaxation parameter ω ∈ (0, 1].

Proof. First, since H = A + iD, the imaginary part of H−1 can be solved. By

Lemma 7.5,

Im(H−1) = −A−1D(A+DA−1D)−1.

Then by the fact that both A,D are real matrices and Lemma 7.2 (Woodbury

matrix identity),

Ψ(ω) = I − ω(I +A−1DA−1D)−1. (3.13)

Inspection reveals that, given ω, the spectrum of Ψ(ω) is fully determined by that

ofA−1D. BecauseA−1/2DA−1/2 is skew-symmetric, the eigenvalues of the matrix

A−1D must be conjugate pairs of pure imaginaries or zero by Proposition 7.11.

Let ±ηi be such a pair with η ≥ 0 and u be the eigenvector corresponding to the

eigenvalue ηi. We have

A−1Du = iηu

which can be rearranged to get iu∗Du = −ηu∗Au where u∗ denotes the conjugate

transpose of u. From the definition in (3.10), H is a Hermitian positive definite

matrix, which yields

u∗Hu = u∗(A+ iD)u = (1− η)u∗Au > 0.

86

Since A is also positive definite, we must have η ∈ [0, 1). (This also implies

(I +A−1DA−1D) is invertible.) Using (3.13), we can show Ψ(ω) must have two

eigenvalues equal to (1− η2 − ω)/(1− η2). Hence to make ρ(Ψ(ω)) smaller than

1, we just need

|1− η2 − ω

1− η2| < 1 ⇔ η <

√1− ω/2 (3.14)

hold for every possible η. Since η is always strictly smaller than 1, there must

exist a positive ω that satisfies (3.14).

Consider the special case ω = 1. Then from the proof of the last theorem, the

spectral radius of Ψ can be computed using that of A−1D.

Corollary 3.5. If ω = 1, ρ(Ψ) = ρ2(A−1D)/(1− ρ2(A−1D)).

Proof. By definition ηmax = ρ(A−1D). The claim is then proved by noticing that

all the eigenvalues of Ψ(1) must be non-positive.

When ω = 1, we can also formulate a sufficient condition for the convergence

of ICF.

Corollary 3.6. If Σ = σ−2I and ω = 1, a sufficient condition for the convergence

of ICF is maxi>j|Rij| < 1/σ

√2.

Proof. By submultiplicativity of matrix norm,

ρ(A−1D) ≤ ρ(A−1)ρ(D) ≤ σ2ρ(D) ≤ σ||D||max = σmaxi>j|Rij|

By (3.14), maxi>j|Rij| < 1/σ

√2 is a sufficient condition for ρ(Ψ(ω)) < 1.

87

3.4.2 Tuning the Relaxation Parameter for ICF

To make ICF generally applicable, it is necessary to work out a convenient way

to choose the relaxation parameter ω. From the proof of Theorem 3.4, it can be

seen there is close relationship between ρ(Ψ) and the spectrum of A−1D. Indeed,

if the latter is known, the optimal value for ω can be analytically evaluated.

Corollary 3.7. Let ηmin and ηmax denote the smallest and largest absolute value

of eigenvalue of A−1D. Then the optimal value for ω is

ω∗ = 2(1

1− η2min

+1

1− η2max

)−1.

If ηmin = 0, we have ω∗ = (2− 2η2max)/(2− η2

max) and

ρ(Ψ(ω∗)) = η2max/(2− η2

max) = 1− ω∗.

Proof. From the proof of Theorem 2, we know that the smallest and the largest

eigenvalue of Ψ(ω) are 1−ω/(1−η2max) and 1−ω/(1−η2

min). They can be positive

or negative. Recall η ∈ [0, 1). Since the optimal value of ω must minimize the

spectral radius of Ψ, we have

ω∗ = arg minω∈(0,1]

ρ(Ψ(ω)) = arg minω∈(0,1]

max1− ω

1− η2min

,ω

1− η2max

− 1.

ω∗ is attained when the two quantities in the braces on the right-hand side are

equal.

By the property of skew-symmetric matrix, ηmin = 0 when p is odd. When p is

even, it is still extremely small for a moderate sample size. Therefore it is treated

as 0 in our discussion. Simulation shows that for a large design matrix that is not

ill-conditioned, like in GWAS, ηmax is usually close to zero. As a result, by simply

88

choosing ω = 1, which is in fact near optimal, ICF converges strikingly fast. In

Figure 3.1), both ρ(Ψ(1)) and ρ(Ψ(ω∗)) from simulated data are plotted against

sample size n (p = 500). When n < p, ρ(Ψ(1)) 1 and thus ICF fails. Even

if the optimal ω∗ is used, the spectral radius is still very close to 1, which means

the convergence cannot be attained within a reasonable number of iterations. But

once n grows greater than p, ρ(Ψ(1)) plummets. This phenomenon persists for

other choices of p.

0 500 1000 1500 2000

−2

02

4

n

log 1

0(ρ)

Ψ(1)Ψ(ω*)

0 500 1000 1500 2000

−2

02

4

n

log 1

0(ρ)

Ψ(1)Ψ(ω*)

Figure 3.1: The relationship between ρ(Ψ(1)), ρ(Ψ(ω∗)) and n with p = 500 andΣ = I. For each n we simulate 100 datasets with independent predictors. Thedots denote the mean and the grey bars indicate the 2.5% and 97.5% quantiles.Xij is sampled from Bin(2, fj) with fj ∼ U(0.01, 0.99) in the left panel and fromN(0, 1) in the right.

If the design matrix X has severe multicollinearity, even if n p, it is likely

that η2max > 1/2 so that ICF does not converge for ω = 1. To show this, a sub-

dataset that contained the first 20, 000 SNPs on the chromosome 1 from the IOP

dataset (see Chap. 7.7.1) is constructed. For given n and p, X is sampled from

this sub-dataset and its corresponding ηmax is computed. Then, by Corollary 3.7,

assuming ηmin = 0, the optimal spectral radius ρ(Ψ(ω∗)) can be computed. This

is repeated for 1, 000 times. Since neighboring SNPs often have a high correlation

due to linkage disequilibrium, when p is large, say several hundred, the design

89

matrix X is susceptible to severe collinearity. Figure 3.2 displays the distribution

of ρ(Ψ(ω∗)). Recall that when ηmax is very small, ω∗ is close to 1 and thus

ρ(Ψ(ω∗)) is close to zero. But in Fig. 3.2, a large proportion of ρ(Ψ(ω∗)) is away

from zero and for those cases, using ω = 1 even cannot achieve convergence (this

can be shown using Corollary 3.5 and 3.7). Fortunately, ρ(Ψ(ω∗)) is still away

from 1, which implies that the optimal convergence rate is always acceptable. An

extremely interesting observation is the peak at ρ(Ψ(ω∗)) ≈ 1/3 in Fig. 3.2, which

corresponds to ω∗ = 2/3 and η2max = 1/2, the boundary value for the convergence

of ICF with ω = 1.

Figure 3.2: The distribution of ρ(Ψ(ω∗)) the IOP chr1 dataset. ηmax is computedusing the eigendecomposition of A−1D. The optimal spectral radius is then com-puted by ρ(Ψ(ω∗)) = η2

max/(2− η2max), which is equal to 1− ω∗.

90

From these simulations, it can be concluded that as long as n p and an

appropriate value is chosen for ω, ICF should converge rapidly. In fact, it is not

difficult to automatically adjust ω in the iterations. We start from ω(0) = 1 and

assume ηmin = 0. The idea is to compute an estimate for the spectral radius of

Ψ(ω) in each iteration, which in turn helps us determine the value for ω in the

next iteration. As the number of iterations grows larger, the choice of ω shall tend

to the optimal value.

Since overall ω(k) has a decreasing trend, the spectral radius of Ψ at the k-th

iteration is

ρ(Ψ(ω(k))) = max1− ω(k) ,ω(k)

1− η2max

− 1 ≈ ω(k)

1− η2max

− 1.

Although sometimes 1− ω(k) may be greater, it only occurs when ω(k) is near ω∗

and therefore ρ(Ψ(ω(k))) can be approximated byω(k)

1− η2max

− 1 by Corollary 3.7.

A heuristic way to estimate the spectral radius of Ψ is given by

ρ(k) =||β(k) − β(k−1)||2||β(k−1) − β(k−2)||2

,

which leads to an updating of ω by

ω(k+1) =2ω(k)

1 + ω(k) + ρ(k).

Theoretically this estimation may break down in some situations. A trivial case is

β(0) = β. A non-trivial situation is that the error vector may be orthogonal to the

eigenvectors of Ψ that correspond to the largest (in absolute value) eigenvalue. In

practice such special situations are very rare and this adaptive procedure works

very well in our simulation studies.

91

3.5 Performance Comparison by Simulation

3.5.1 Methods

Simulation is used to compare the performance of seven methods: ICF, Cholesky

decomposition (of matrix A), Jacobi method, Gauss-Seidel method, successive

over-relaxation, steepest descent and conjugate gradient. Two sub-datasets are

constructed using the IOP dataset (see Chap. 7.7.1). The first sub-dataset contains

20, 000 SNPs evenly distributed across the whole genome, which henceforth is

referred to as IOP-ind. These SNPs can be regarded as independent from each

other due to their distant genomic locations. The second contains the first 20, 000

SNPs on chromosome 1, which henceforth is referred to as IOP-chr1. As explained

in the last section, severe multicollinearity are very likely to occur when the design

matrix is sampled from this sub-dataset. For given n and p, X is sampled from

each sub-dataset and y is simulated under both the null, y ∼ MVN(0, I), and the

alternative, y ∼ MVN(0, I + 0.52XX t). Then the seven methods are applied to

compute the corresponding ridge estimator with σ = 0.5. The Jacobi method and

steepest descent are immediately excluded from the following experiments owing

to bad performance. Both methods easily fail to converge for p = 200 and fails

every time for p = 500 using the IOP-ind dataset.

Five methods enter the final comparison: ICF, Cholesky decomposition (Chol),

Gauss-Seidel method (GS), successive over-relaxation (SOR), and conjugate gra-

dient (CG). For SOR, we have to choose a value for the relaxation parameter,

which is known to be very difficult. The most famous result concerning this prob-

lem is due to Young [1954] (see also Yang and Matthias [2007]). Unfortunately,

the assumption of Young’s rule is not met when p grows greater than 500. Thus

we did some tests and decided to use 1.2 for the relaxation parameter of SOR,

92

which appeared to produce a better overall performance than other choices. For

all the iterative methods, we start from β(0) = 0 and stop if

||β(k) − β(k−1)||∞ = maxi|β(k)

i − β(k−1)i | < 10−6 (3.15)

or the number of iterations exceeds M , where M = 50 for ICF and M = 200 for

other methods. Initializing β(0) to simple linear regression estimates was also tried

but all results remained essentially unchanged. The code was written in C++

and the Cholesky decomposition was implemented using GSL (GNU Scientific

Library) [Gough, 2009]. GS and SOR were implemented according to (3.8) to

obtain maximum efficiency.

3.5.2 Wall-time Usage, Convergence Rate and Accuracy

Simulation shows that ICF is the best in terms of wall-time usage, convergence

rate and numerical accuracy.

Wall-time Usage Sample size n is fixed to 3000 and p ranges from 10 to 1200.

For each p, the simulation of X and y is repeated for 1000 times. The total wall

time used of the five methods is shown in Fig. 3.3, and part of the exact numerical

values are provided in Table 3.1. It is very clear that as p grows larger, ICF has

an overwhelming advantage over all the other methods. In fact, in the simulation

with IOP-chr1 dataset, all the other iterative methods fail. Gauss-Seidel method

and successive over-relaxation fail to converge within 200 iterations in most cases

when p ≥ 600 (see Table 3.1). If we really want to use these two methods to

achieve convergence, the time usage should be much greater than the Cholesky

decomposition. Conjugate gradient, although converges in all cases as ICF does, is

always slower than the Cholesky decomposition. Since the Cholesky decomposition

93

is an exact direct method, there is no reason to use conjugate gradient either.

Figure 3.3: Comparison of the wall time usage of the five methods for computing1, 000 ridge estimators.

94

Time (in seconds) Convergence failures

p Dataset Chol ICF GS SOR CG ICF GS SOR CG

50 IOP-chr1 0.035 0.020 0.029 0.031 0.107 0 8 6 0

50 IOP-ind 0.034 0.019 0.016 0.025 0.104 0 0 0 0

200 IOP-chr1 1.39 0.45 2.13 1.69 2.66 0 125 93 0

200 IOP-ind 1.38 0.304 0.339 0.385 2.29 0 1 1 0

400 IOP-chr1 10.9 2.90 19.1 16.4 16.0 0 427 344 0

400 IOP-ind 11.0 1.64 2.05 1.76 11.5 0 1 0 0

600 IOP-chr1 35.6 10.3 58.0 53.6 51.3 0 754 639 0

600 IOP-ind 35.8 4.53 6.40 4.86 31.2 0 5 4 0

800 IOP-chr1 82.7 20.5 115.5 112 125 0 933 866 0

800 IOP-ind 83.0 7.85 15.2 10.9 65.1 0 10 8 0

1000 IOP-chr1 160 35.8 183 180 244 0 979 951 0

1000 IOP-ind 161 14.3 35.9 25.1 136 0 11 7 0

1200 IOP-chr1 270 63.6 290 287 450 0 998 992 0

1200 IOP-ind 269 20.3 62.5 43.1 202 0 26 22 0

Table 3.1: The wall time usage and the number of convergence failures of the five

methods for computing 1, 000 ridge estimators under the null model. The “Time”

columns correspond to the points in Figure 3.3. “Convergence failures” columns

give the number of cases that fail to stop before M iterations where M = 50

for ICF and M = 200 for the other iterative methods. Simulation under the

alternative produces very similar results.

Convergence Rate To further investigate the difference between the itera-

tive methods, I compare the number of iterations used by ICF, successive over-

relaxation and conjugate gradient. Gauss-Seidel is excluded since it is always

95

poorer than successive over-relaxation. 10, 000 pairs of X and y are simulated

under the null for p = 500, 1000. The distributions of the number of iterations

used to stop are shown in Fig. 3.4. SOR only works for the IOP-ind dataset. It is

excluded in the panels for the IOP-chr1 dataset since 50% of the cases use more

than 200 iterations for p = 500 and 96% for p = 1000. Conjugate gradient, on the

other hand, always uses much more iterations than ICF to converge.

Figure 3.4: Comparison of the number of iterations used by ICF, SOR and CG.

Accuracy In the simulation described in last paragraph, the maximum absolute

error for all the cases that have converged are also calculated. Recall our stopping

criterion (3.15), which was designed with the aim of controlling this error < 10−6.

Figure 3.5 shows that ICF usually achieves this precision while the other two

almost never do so. Since X and y are simulated under the null, the entries of the

true β are usually very small. The maximum relative error is found to be usually

two orders of magnitude greater than the maximum absolute error.

96

Figure 3.5: Comparison of the accuracy of ICF, SOR and CG. Each panel showsthe distributions of the maximum absolute error on log10 scale. Red bars standfor ICF, blue for CG and yellow for SOR. The non-convergent cases are excludedand thus for SOR the total number is much smaller than 10, 000 for the IOP-chr1dataset.

97

Chapter 4

Bayesian Variable Selection

Regression

In genome-wide association studies, one central goal is to identify the causal SNPs,

or rather, the SNPs that are associated with the phenotype, from millions of

genotyped SNPs. This procedure is called variable selection by statisticians. The

classical approach, which is simple but effective, is to use the regression model.

In this chapter, first the methods for Bayesian variable selection based on linear

regression are reviewed in Chap. 4.1. Then a novel MCMC algorithm implementing

the ICF method (introduced in the last chapter) is proposed, which substantially

expedites the model fitting of Bayesian variable selection.

4.1 Background and Literature Review

Consider the regression model,

y = µ1 +Xβ + ε = µ1 +N∑i=1

βixi + ε, ε ∼ MVN(0, τ−1I),

98

where X is an n×N matrix with N n and ε represents the errors, which are

usually assumed to be i.i.d. normal random variables. Both y and X are assumed

to be centered so that the intercept term is omitted. If there are other confounding

covariates to be controlled for, they should also be regressed out from y and X.

To avoid overfitting, we have to impose some constraints on β. For example, we

may use ridge regression to shrink the estimates for βi to zero. However, it is

often difficult to interpret such results. A more convenient way is to assume the

sparseness of the model, i.e., to assume that most elements of β are zero. This

assumption is indeed very desirable in most applications. For example, in GWAS,

each column of X represents a SNP and only a small number of SNPs may be

correlated with the phenotype y. The identification of these causal SNPs is of

primary interest to most GWAS studies.

A wide variety of Bayesian methods for variable selection have emerged since

late 1980s. It is difficult to divide them into distinct categories due to the common

features shared between different methods. For a comprehensive review structured

in this way, see O’Hara and Sillanpaa [2009]. I take a different approach. The

review is separated into two parts, the formulation of the model and how to fit the

model. They are the two defining elements of a Bayesian variable selection method.

For more introductory materials or surveys, see Miller [2002, Chap. 7] and Walli

[2010] among others. Innovations concerning the model selection criterion, such

as the deviance information criterion [Spiegelhalter et al., 2002], the fractional

Bayes factor [O’Hagan, 1995] and the intrinsic Bayes factor [Berger and Pericchi,

1996a,b], are beyond the scope of this chapter and thus not to be discussed.

99

4.1.1 Models for Bayesian Variable Selection

Indicator Variable and “Spike and Slab” Prior Unlike frequentists’ model

selection methods, for example LASSO [Tibshirani, 1996] which outputs a single

best model, Bayesian variable selection procedures usually attempt to compute the

marginal posterior probability of every predictor’s being included in the model.

This can be achieved by introducing the auxiliary indicator variable γ ∈ 0, 1N

and using the “spike and slab” prior [Ishwaran and Rao, 2005],

βi | γii.i.d.∼ (1− γi)F0 + γiF1. (4.1)

F0 is a distribution concentrated at or around zero (“spike”) and F1 is a flat

distribution spread over a wide region (“slab”). Thus, the variable selection has

been translated to the parameter estimation of γ. Typically, the prior for γ is

γi | πi.i.d.∼ Bernoulli(π).

π may be chosen a priori or treated as a hyperparameter (and thus requires a

hyperprior). In most applications, π is chosen very small to reflect the prior

belief that only a small number of the covariates are truly effective, unconsciously

inducing a penalty term on model complexity to the marginal likelihood. P(γi = 1 |

y) is the posterior probability of a predictor being in association with the response

variable and is sometimes referred to as the posterior inclusion probability (PIP).

The estimation of the posterior inclusion probabilities is a major goal of Bayesian

variable selection.

The first use of this prior seemed due to Mitchell and Beauchamp [1988], where

F0 is chosen to be the degenerate distribution with unit probability mass at 0,

denoted by δ0, and F1 is a uniform distribution on (−b, b) for some b > 0. A more

100

common choice for F1 is a normal distribution with mean 0. For instance see Chen

and Dunson [2003] which also generalizes the “spike and slab” prior via a matrix

factorization. Note that F0 and F1 may be allowed to change with i, for example,

when the covariates are heterogeneous and form several distinct groups. Another

important example of “spike and slab” prior is called stochastic search variable

selection (SSVS), proposed by George and McCulloch [1993]. They considered

βi | γi ∼ (1− γi)N(0, φi) + γiN(0, ciφi).

By letting φi be small and ci be large, this prior also takes “spike and slab”

shape. This method was later generalized to non-conjugate settings in George

and McCulloch [1997]. A practical difficulty in the implementation of SSVS is

the tuning of the parameters ci, φi, which is crucial to the mixing of the Gibbs

sampling chain and thus the accuracy of the posterior inferences.

Shrinkage Priors for β Instead of using the indicator variable γ, the model

sparseness can also be attained by using a shrinkage prior for β. Such a shrinkage

prior should approximate the “spike and slab” shape so that when the data displays

no evidence for association between βi and y, the estimate for βi is shrunk to

zero. A common approach is to let βi | φii.i.d.∼ N(0, φi) and then put a shrinkage

hyperprior for φ. For example, we may let p(φi) ∝ 1/φ, which is sometimes called

Jeffreys’ prior. See Xu [2003] for an application to gene mapping. However, the

use of this model is very controversial because the joint posterior distribution is

improper, though the full conditionals, which are used by Gibbs sampling, are

proper. See Hobert and Casella [1996] for the proof. An alternative choice is to

101

use

p(βi | φ) =1√2φ

exp

(−√

2√φ|βi|

),

i.e., to use a double exponential (Laplace) prior for βi. This prior is referred to as

“Bayesian LASSO” [Park and Casella, 2008] since its maximum a posteriori (MAP)

estimation coincides with the LASSO method. φ may be chosen by an empirical

Bayesian method or given a shrinkage hyperprior. To do variable selection with

these models, one can simply set a threshold C and select the covariates with

|βi| > C.

The mixture of g-priors is another important example. Recall Zellner’s g-prior

β | g ∼ MVN(0, gτ(X tX)−1).

Liang et al. [2008] proposed to put the following hyperprior on g,

p(g) =a− 2

2(1 + g)−a/2, g > 0.

In their simulation studies, they used a = 3. An alternative is Zellner-Siow

prior [Zellner and Siow, 1980], which they show is actually a special case of mix-

tures of g-priors with an inverse-gamma prior on g. Remarkably, Liang et al. [2008]

proved the consistency of using this prior for both model selection and prediction.

To obtain the PIPs, for small N we may compute the marginal likelihood of every

possible model (or equivalently, the null-based Bayes factors) and then compute

the PIPs by model averaging.

102

4.1.2 Methods for the Model Fitting

Markov Chain Monte Carlo Methods There is usually no closed-form ex-

pression for the posterior distribution in Bayesian variable selection. The most

common strategy for posterior inference is to use Markov chain Monte Carlo

(MCMC) methods to generate samples from the posterior. For an introduction to

various MCMC algorithms, see Liu [2008] and Brooks et al. [2011] among others.

In early attempts, Gibbs sampling [Liu, 2008, Chap. 6] was often used to fit the

Bayesian variable selection models with “spike and slab” priors [George and Mc-

Culloch, 1993, Kuo and Mallick, 1998]. However, if F0 and F1 in (4.1) are very

different, a proposal of flipping γi may be very unlikely to be accepted because

a value of βi sampled from F1 may be regarded as very unlikely under F0. As

a result, the mixing of the Markov chain can be very slow and never converge.

Dellaportas et al. [2002] used the idea of Carlin and Chib [1995] and tried to solve

this problem by a “pseudo prior”. In their method, when γi = 0, the covariate xi

is excluded from the model but βi, instead of being set to 0, is sampled from a

prior that is close to its conditional given γi = 1. But this method still relies on

very careful tuning that may be difficult to implement in practice.

Another class of MCMC methods uses the Metropolis-Hastings algorithm (see

Chap. 7.6). An early example was the reversible jump MCMC algorithm [Green,

1995, Sillanpaa and Arjas, 1998], which required the computation of the Jacobian

of the proposal to account for the change in the dimension of the parameter space.

Uimari and Hoeschele [1997] compared the performance of reversible jump MCMC

and other Gibbs-sampling-based approaches. Later it was noticed that the MCMC

could be more efficient if βi’s are integrated out in each iteration. We simply

propose to add or remove a covariate but no longer sample βi’s. The acceptance

ratio is then obtained by computing the null-based Bayes factor or the marginal

103

likelihood of the model. This method is used in a large amount of literature, e.g.,

Godsill [1998], Yi [2004], Guan and Stephens [2011], Zhou et al. [2013]. The model

and the MCMC algorithm of Guan and Stephens [2011] will be discussed in great

detail in the next section.

Approximative and Variational Methods The computational difficulty of

Bayesian inferences arises from an intractable integration. Hence we may approx-

imate this integral by some numerical techniques, e.g. Sen and Churchill [2001],

or finding a tractable asymptotic estimate. Ball [2001] estimates the marginal

likelihood of a model by a modification of the BIC score.

The variational Bayesian method aims to find an analytical approximation to

the true joint posterior distribution by some known closed-form distribution func-

tions. See Smıdl and Quinn [2006] for a full introduction of the theory. On one

hand, since a variational method produces a distributional approximation, various

posterior inferences can be carried out very easily once we obtain the approximat-

ing distributions. On the other, variational methods are deterministic and thus

very fast compared with MCMC. Carbonetto and Stephens [2012] describes a sim-

ple but very efficient iterative variational algorithm for Bayesian variable selection

with “spike and slab” prior. Though the individual PIP estimate is usually off,

which is a known phenomenon for variational Bayesian methods [Fox and Roberts,

2012], the variational estimates for the posteriors of the hyperparameters turn out

to be very accurate in a wide range of settings. A recent work by Huang et al.

[2016] improved this algorithm and proved its consistency when the total number

of covariates grows exponentially fast with the sample size.

Approximate Bayesian computation (ABC) is another class of methods that

directly approximate the likelihood function by simulation [Sunnaker et al., 2013].

104

In its simplest form of rejection sampling, when new values of the parameters are

proposed, ABC decides whether to accept them by simulating data under these

parameters and compare to the observed data. See Stahl et al. [2012] for a recent

application in GWAS.

Searching for Maximum a Posteriori Estimates We can also perform all

the statistical inferences using the MAP estimates, which might significantly re-

duce the computational cost. Methods of optimization theory and machine learn-

ing then can be applied [Tipping, 2004]. Hoggart et al. [2008] applied this strategy

to analyzing the whole-genome SNP data with both the normal-inverse-gamma

prior and the double exponential prior. In a similar vein, Segura et al. [2012] de-

scribed a stepwise variable selection procedure using an asymptotic Bayes factor.

However, in general such approaches are not favored since a paramount advantage

of Bayesian methods is the ability to make inferences by model averaging [Raftery

et al., 1997, Broman and Speed, 2002].

4.2 The BVSR Model of Guan and Stephens

The Bayesian variable selection regression (BVSR) model proposed by Guan and

Stephens [2011] is the focus of this chapter. Their method has the following

advantages.

• There is no need for parameter tuning, unless in some very special applica-

tions. The method is adaptive and the posterior inclusion probabilities can

be accurately estimated under a wide range of settings.

• The MCMC algorithm proposed is very efficient compared with other meth-

ods.

105

• The model is parametrized by using a hyperparameter describing the pro-

portion of variance explained by the covariates. The estimation for this

hyperparameter turns out often to be very accurate.

Their method probably was motivated by the heritability estimation in genome-

wide association studies but could be applied to many other fields.

4.2.1 Model and Prior

The model starts with a conventional spike-and-slab setting with indicator variable

γ.

y | γ,β,X, τ ∼ MVN(Xγβγ , τ−1I),

τ ∼ Gamma(κ1/2, κ2/2), κ1, κ2 ↓ 0

γj ∼ Bernoulli(π),

βj | γj = 1, τ ∼ N(0, σ2/τ),

βj | γj = 0 ∼ δ0.

(4.2)

Consider the application to GWAS. γj = 1 means that the j-th SNP has a causal

effect on the phenotype y. δ0 denotes a point mass at 0. Xγ denotes the design

matrix with columns for which γj = 1 and βγ denotes the corresponding sub-

vector of β. By letting hyperparameters κ1, κ2 (shape and rate parameters) go to

zero, we are actually using the prior p(τ) ∝ 1/τ , which might be called Jeffreys’

prior (for τ). Since when γj = 0, βj = 0, we may also write

y | γ,β,X, τ ∼ MVN(Xβ, τ−1I).

106

For the hyperparameters σ and π, BVSR puts the following hyperpriors:

log π ∼ U(log(πmin), log(πmax)),

σ2 =h

1− h1∑j sjγj

,

h ∼ U(0, 1).

(4.3)

where sj denotes the variance of the j-th SNP. πmin (πmax) is chosen small (large)

enough to ensure that the true value of π is included. For example, we may use

πmin = 1/N where N is the total number of SNPs in the dataset and πmax = 1.

The uniform prior on log π is very critical because it brings about the penalization

on the model complexity that is needed to obtain sparseness. The introduction

of the parameter h is a major novelty of the BVSR model. h has a similar flavor

to the R squared statistic in traditional linear regression and can be interpreted

as the proportion of the variance of y that is due to the additive effects of the

covariates. To see this, notice that τ−1 is the error variance and

Var(Xγβγ) =∑γj=1

Var(xjβj) =σ2

τ

∑γj=1

sj,

assuming independence between xj (centered) and βj. Thus,

h =Var(Xγβγ)

Var(Xγβγ) + τ−1.

In GWAS, h means the narrow-sense heritability, which is defined to be the pro-

portion of the phenotypic variance that is due to additive genetic effects. See

Chap. 1.3.2 for a short review on the heritability estimation. Here are three com-

ments.

1. Just like we did in Chapter 2, we assume X and y are centered and thus

107

drop the intercept term, µ, from the model. Note that if we explicitly include

µ in the model, the prior would change to p(µ, τ) ∝ τ−1/2 (see Chap. 7.2.3).

When there are other variables to be controlled for, we also regress them

out from X and y, which is equivalent to including them in the model and

putting a non-informative prior on their regression coefficients [George and

McCulloch, 1993].

2. The full posterior of the BVSR model specified by (4.2) and (4.3) is proper.

First p(y|γ, h) <∞ since the distribution of τ given y,γ, h is still a gamma

distribution. Second since p(h) is proper, we have p(y|γ) < ∞. Thirdly,

p(γ) =∫p(γ|π)p(π)dπ < ∞ even if we use an improper prior for π by

setting πmin = 0. Lastly, γ can only take a finite number of possible values

and therefore the full posterior for (β, τ, h, π,γ) is proper.

3. The most important advantage of this model is its flexibility. Both h and

π are estimated from the data and the simulation studies showed that both

of them, especially h, can be accurately estimated in various settings. How-

ever, this parametrization also gives rise to identifiability issues when the

data contains almost no signals. In such cases, the data provides no infor-

mation for estimating σ and h. The estimates for σ concentrate near 0 and

consequently, the posterior for both π and h may spread over a wide range.

For practical purposes, this problem is of little concern since it can be easily

diagnosed from the posterior of σ and the large variability of the posterior

of π and h. If we really want to solve it, we may put a uniform prior on log h

instead of h (choose an hmin > 0 to make it proper) such that the posterior

for h is shrunk to zero when the data contains no information for h.

108

4.2.2 MCMC Implementation

In the original work of Guan and Stephens [2011], the posterior inference is done

by an Metropolis-Hastings algorithm with mixed proposals. In each iteration, a

new value for h is proposed by a random walk around the current value; π is

proposed from the full conditional distribution p(π|γ); γ is proposed by adding or

deleting a predictor from the current model. The proposal distribution for adding

a SNP is a mixture of a uniform and a geometric distribution. Lastly, a long-

range proposal is made for γ with probability 0.3, which is to compound the local

proposals randomly many times. Below are the details for each step.

Proposal for h and π By convention let q(θ∗ | θ) denote the proposal distri-

bution for a new value θ∗ given the current value θ. The proposal for h is simply

a random walk,

h∗ | h ∼ Unif(h− 0.1, h+ 0.1).

When an invalid value (outside of (0, 1)) is proposed, the proposed value is then

reflected about the boundary. Therefore the proposal ratio for h is always equal

to 1.

For π, first notice that it actually can be integrated out. Let’s assume a more

general form for the prior of π,

p(π) ∝ πa−1(1− π)b−1I(πmin,πmax)(π),

that is, a truncated beta prior with a, b ≥ 0. The prior specified in (4.3) is then a

special case with a = 0 and b = 1. Define the beta function and the distribution

109

function of the beta distribution by

B(a, b)def.=

Γ(a)Γ(b)

Γ(a+ b),

FB(a,b)(x1, x2)

def.= P(x1 < X < x2), where X ∼ Beta(a, b).

Then we may write

p(π) = I(πmin,πmax)(π)πa−1(1− π)b−1

B(a, b)FB(a,b)(πmin, πmax)

,

and the marginal prior probability for γ is

p(γ) =

∫p(γ|π)p(π)dπ =

B(a+ |γ|, b+N − |γ|)FB(a+|γ|,b+N−|γ|)(πmin, πmax)

B(a, b)FB(a,b)(πmin, πmax)

,

(4.4)

where |γ| = ||γ||0 =∑γj is the number of covariates in the model. Clearly

the denominator is a constant independent of γ. Since the beta function and

the cumulative distribution function of the beta distribution are both easy to

calculate, there is actually no need to sample π in the MCMC sampling. Besides,

the posterior for π is not of much interest, or at least is of less interest than the

posterior of |γ|. However, if one insists on sampling π, it could be proposed from

the conditional distribution,

π∗ | γ∗ ∼ I(πmin,πmax)Beta(a+ |γ∗|, b+N − |γ∗|).

after we have sampled γ∗. When πmin = 0 and πmax = 1, the sampling is easy.

But when (πmin, πmax) is a region of very small probability, the sampling could

be time-consuming. In our new implementation, the algorithm of Damien and

Walker [2001] is used to tackle this difficulty.

110

Proposal for γ The local proposal for γ includes adding a covariate and delet-

ing a covariate. The two types are equally likely. For deletion, the proposal is a

uniform distribution on the all the covariates in the model. For addition, the pro-

posal is a mixture distribution. With probability 0.7, we randomly choose a SNP

from all the SNPs that are not in the current model with equal probability. With

probability 0.3, the SNP to be added is selected according to a truncated geomet-

ric distribution on all the candidate SNPs for which γj = 0. For this geometric

proposal, the SNPs are ordered such that those with larger single-SNP Bayes fac-

tors (from Bayesian simple linear regression) are more likely to be proposed. This

mixture proposal significantly improve the efficiency of the MCMC chain when we

have a very large number of SNPs. Without the geometric proposal, the addition

may have a very small acceptance probability since the majority of the SNPs are

not correlated with the phenotype.

Great care must be taken when calculating the proposal ratio q(γ | γ∗)/q(γ∗ |

γ). Conditioning on the current value of γ, the proposal probability for adding

the k-th SNP is

0.31

N − |γ|+ 0.7

s(1− s)R(k;γ)

1− (1− s)N−|γ|,

where s is the success probability of the geometric proposal and R(k;γ) is the

rank of the single-SNP Bayes factor of the k-th SNP in the set of all the SNPs for

which γj = 0, which has to be recomputed every time.

111

Computing the Acceptance Ratio The marginal likelihood of the current

model is given by (see (1.3) and Chap. 7.2)

p(y | γ, σ2(h,γ)) ∝ 1

σ|γ||X tγXγ + σ−2I|1/2

(1−

ytXγ(X tγXγ + σ−2I)−1X t

γy

yty

)−n/2.

(4.5)

Then for the prior specified in (4.3), the Metropolis-Hastings ratio can be calcu-

lated by

α((h,γ), (h∗,γ∗)) =p(y | h∗,γ∗)p(h∗,γ∗)q(h,γ | h∗,γ∗)p(y | h,γ)p(h,γ)q(h∗,γ∗ | h,γ)

=p(y | h∗,γ∗)p(y | h,γ)

q(γ | γ∗)q(γ∗ | γ)

Γ(|γ∗|)Γ(N − |γ∗|+ 1)FB(|γ∗|,N−|γ∗|+1)(πmin, πmax)

Γ(|γ|)Γ(N − |γ|+ 1)FB(|γ|,N−|γ|+1)(πmin, πmax)

.

Compounding the Local Proposals In every MCMC iteration, with prob-

ability 0.3, we propose a long-range proposal for γ. To this end, we sample an

integer K from a uniform distribution on 2, 3, . . . , 10 and then propose K local

proposals. The proposal ratio is simply calculated by multiplying the ratio of each

local proposal. To see that this calculation leaves the posterior invariant, by the

detailed balance condition (see Chap. 7.6), we only need to show

p(y | h,γ)p(h,γ)∑

%∈P(γ→γ∗)

q(γ%→ γ∗) min1, p(y | h

∗,γ∗)p(h∗,γ∗)

p(y | h,γ)p(h,γ)

q(γ∗%→ γ)

q(γ%→ γ∗)

= p(y | h∗,γ∗)p(h∗,γ∗)∑

%∈P(γ∗→γ)

q(γ∗%→ γ) min1, p(y | h,γ)p(h,γ)

p(y | h∗,γ∗)p(h∗,γ∗)q(γ

%→ γ∗)

q(γ∗%→ γ)

,

where P(γ → γ∗) denotes the set of all possible proposal paths from γ to γ∗.

The key observation is that every path % ∈ P(γ → γ∗) can be reversed to become

another path, denoted by %, from γ∗ to γ. Hence there exists a one-to-one mapping

between the sets P(γ → γ∗) and P(γ∗ → γ) and for each pair, the detailed

112

balance condition holds. Thus by summing over these paths, the detailed balance

condition is still satisfied. The mixing of the MCMC chain can be greatly expedited

by using the long-range proposal. Intuitively speaking, this is because they make

the sampling chain jumps rapidly across the whole sample space and help the chain

overcome local traps. See Guan and Krone [2007] for a theoretical argument.

Rao-Blackwellization Rao-Blackwellization [Blackwell, 1947] is a common tech-

nique used in sampling schemes to reduce the variance of the estimates [Casella

and Robert, 1996, Douc and Robert, 2011]. Guan and Stephens [2011] also pro-

posed a Rao-Blackwellization procedure to estimate γ and β by computing

E[γj | y,γ−j,β−j, τ, h, π] and E[βj | γj = 1,y,γ−j,β−j, τ, h, π], (4.6)

where γ−j and β−j denotes the corresponding sub-vectors with the j-th element

removed. To calculate E[γj | y,γ−j,β−j, τ, h, π], we only need to figure out

p(γj = 1 | y,γ−j,β−j, τ, h, π)

p(γj = 0 | y,γ−j,β−j, τ, h, π)

=p(γj = 1 | γ−j, τ, h, π)

p(γj = 0 | γ−j, τ, h, π)

p(β−j | γj = 1,γ−j, τ, h, π)

p(β−j | γj = 0,γ−j, τ, h, π)

p(y | γj = 1,γ−j,β−j, τ, h)

p(y | γj = 0,γ−j,β−j, τ, h).

The first ratio term on the r.h.s. is simply equal to π/(1− π). The second can be

calculated by using the fact that

β−j | γ, τ, h, π ∼ MVN(0,σ2(h,γ)

τI).

Note that the second ratio is not equal to 1 because σ2(h,γ) changes with the

value of γj. Let Xγ−j be the sub-matrix of X with all the columns, except the

113

j-th column, for which γi = 1, and βγ−j be the corresponding sub-vector. Then,

y | γj = 1,γ−j,β−j, τ, h ∼ N (Xγ−jβγ−j, τ−1(σ2

aXjXtj + I))

Therefore, we can compute the third ratio term by

p(y | γj = 1,γ−j,β−j, τ, h)

p(y | γj = 0,γ−j,β−j, τ, h)= σ−1(X t

jXj + σ−2)−1/2 expτ2

ytXjXtj y

σ−2 +X tjXj

,

where y = y −Xγ−jβγ−j and σ2 is calculated assuming γj = 1. Although these

quantities are very fast to evaluate, we cannot afford Rao-Blackwellization in every

MCMC iteration because we have to go through every SNP. In our implementation

by default we perform Rao-Blackwellization every 1000 iterations.

4.3 A Fast Novel MCMC Algorithm for BVSR

using ICF

For a whole genome dataset that typically contains millions of genetic variants,

we have to run a very large number of MCMC iterations in order to obtain good-

quality posterior estimates. Filtering the variants before running MCMC is not

recommended in light of the small effect size and the collinearity of the data. The

Bayesian methods then become less favorable since it could take one week of an

MCMC algorithm to obtain sensible results on a state-of-the-art computer.

The main reason is that when |γ| is large, evaluating (4.5) is very time con-

suming due to the calculations of

1. the determinant of a |γ| × |γ| matrix, |X tγXγ + σ−2I|,

2. the inverse of a |γ| × |γ| matrix, (X tγXγ + σ−2I)−1.

114

In this section, I describe a novel MCMC algorithm that bypasses the calculation

of the determinant by using the exchange algorithm and efficiently evaluates the

inverse using ICF algorithm introduced in Chap. 3.

4.3.1 The Exchange Algorithm

The exchange algorithm was proposed by Murray et al. [2012], which is actually a

variation of the auxiliary method of Møller et al. [2006]. It was devised for sampling

from “doubly intractable” posterior distributions where the marginal likelihood

also has an intractable normalizing constant, which in our problem corresponds to

the determinant term that we want to avoid computing. To illustrate their ideas,

define

Z(γ, σ2(h,γ)) = σ−|γ||X tγXγ + σ−2I|−1/2,

f(y | γ, σ2(h,γ)) =(yty − ytXγ(X t

γXγ + σ−2I)−1X tγy)−n/2

.

Thus by (4.5) we have p(y | γ, σ2) ∝ Z(γ, σ2)f(y | γ, σ2) . The exchange algo-

rithm estimates the ratio Z(γ∗, σ2∗)/Z(γ, σ2), which is independent of y, by an

unbiased importance sample f(y′ | γ, σ2)/f(y′ | γ∗, σ2∗) where y′ is sampled from

p(· | γ∗, σ2∗). The Metropolis-Hastings ratio of the exchange algorithm is

α((h,γ), (h∗,γ∗)) =f(y′ | γ, σ2)f(y | γ∗, σ2∗)

f(y′ | γ∗, σ2∗)f(y | γ, σ2)

p(γ∗)q(γ|γ∗)p(γ)q(γ∗|γ)

, y′ ∼ p(· | γ∗, σ2∗),

(4.7)

(See Chap. 4.2.2 for the calculation of p(γ) and q(γ∗ | γ).) It should be pointed

out that when sampling y′, we don’t need to sample τ because the scaling of y′

always cancels out during the calculation of the acceptance ratio.

By checking the detailed balance condition, this strategy can be proved to

115

Cholesky decomposition of X tγXγ , which at first glance is very undesirable since

the Cholesky decomposition can take much more time than ICF itself. However,

in our MCMC algorithm, most of the time we only propose to add or remove

on column of Xγ . Even when a long-range proposal is made, the number of

columns to be changed is often very small compared with |γ|. The Cholesky

decomposition then can be obtained by updating, which is very fast. The details

of the updating are laid out below. In principle it is very similar to the updating

of QR factorization [Golub and Van Loan, 2012, Chap. 12.5].

Let the current Cholesky decomposition be

X tγXγ = Rt

γRγ ,

where Rγ is always upper-triangular. When a new SNP xj is added to the model,

we attach it to the last column of Xγ and compute the corresponding entries of

Rγ by forward substitution according to

Rtγr|γ+1| = X t

γxj.

The forward substitution needs only ∼ |γ|2 flops. Calculating the r.h.s. in fact

requires much more operations (2n|γ| flops) but at least part of it could be pre-

computed and saved in the memory.

When a SNP is deleted, we use Givens rotation [Golub and Van Loan, 2012,

Chap. 5.1] to introduce zeros. For example, suppose |γ| = 4 and we want to

remove the second column of Xγ . The new matrix is denoted by Xγ∗ . Then we

117

first remove the second column of Rγ , which results in

R =

r11 r13 r14

0 r23 r24

0 r33 r34

0 0 r44

.

Though we have RtR = X tγ∗Xγ∗ , R is not upper-triangular. We now construct

a Givens rotation matrix

G1 =

1 0 0 0

0 r23/ρ r33/ρ 0

0 −r33/ρ r23/ρ 0

0 0 0 1

,

where ρ =√r2

23 + r233. As a rotation matrix, clearly Gt

1G1 = I. Moreover, Gt1R

sets r33 to zero as we want. Similarly we can then define matrix G2 to set r44 to

zero. Note that the order cannot be interchanged (we must eliminate r33 first ).

For a general upper-triangular matrix Rγ , using Givens rotation to remove the

(|γ| − k)-th column requires only 3k(k + 1) flops.

4.3.3 Summary of fastBVSR Algorithm

The new MCMC algorithm is called fastBVSR and it is summarized below.

118

Algorithm fastBVSR

Initialize X(0)γ and calculating the corresponding Cholesky decomposition

for i = 1 to Nmcmc do

Propose h∗,γ∗ using h(i),γ(i) and compute σ∗ by (4.3)

Compute the Cholesky decomposition of X tγ∗Xγ by updating

Draw y′ ∼ p(·|γ∗, σ∗) using τ = 1

Calculate α((h,γ), (h∗,γ∗)) by (4.7) using ICF algorithm

Set h(i+1) = h∗ and γ(i+1) = γ∗ with probability min1, α and stay otherwise

if (i+ 1) mod 1000 = 0 then

Sample π(i+1) and τ (i+1) and do Rao-Blackwellization

end if

end for

4.4 GWAS Simulation

The Height dataset, which contains 3, 925 subjects and near 300K SNPs, is used

for simulation. It was the dataset used in Yang et al. [2010]. See Chap. 7.7.2 for

more details. In reality, it is usually not recommended to run an MCMC algorithm

with as many as 300K variants since the chain has to be run for a huge number of

iterations to explore all the SNPs, let alone to achieve convergence. Besides, for

most purposes, especially with our linear regression model, we can safely partition

the data by chromosome. Therefore, to reduce the number of SNPs, the following

three sub-datasets are constructed.

1. Height-ind: 10K SNPs evenly sampled from the whole genome.

2. Height-chr6: the first 10K SNPs located on chromosome 6.

3. Height-5C: the 97, 370 SNPs located on chromosome 1, 2, 3, 4 and 5.

119

The first two sub-datasets are used to study the effect of multicollinearity. Clearly,

Height-ind represents a dataset with no or very little multicollinearity and Height-

chr6 has severe multicollinearity due to linkage disequilibrium. Chromosome 6 is

picked because the MHC (major histocompatibility complex) region located on

it is well known for its highly complicated LD structure. The last sub-dataset

contains around 100K SNPs on 5 chromosomes and represents the most difficult

case. Our algorithm fastBVSR is implemented in C++ and it is compared to

GCTA (see Chap. 7.5).

4.4.1 Posterior Inference for the Heritability

Recall the definition of h given in (4.3). It is not exactly equal to the proportion

of variation explained (PVE), which is defined by

PVEdef.=

n∑i=1

(N∑j=1

(xj − xj)βj)2∑i=1n

(yi − y)2.

(n is the number of subjects and N is the total number of SNPs.) Even the

expected value of PVE is slightly different from h since the expectation of the

ratio of the two random variables (actually two correlated random variables) is

not equal to the ratio of the corresponding expectations. However, the parameter

h has the same meaning as the ratio of the variance components in the linear

mixed model, which is the heritability definition used by GCTA. Hence we refer

to the posterior inference for h as the posterior inference for the heritability.

For both Height-ind and Height-chr6 datasets, the phenotypes with heritability

h = 0, 0.01, . . . , 0.99 are simulated. For each choice of heritability, 200 causal SNPs

are randomly sampled. Figure 4.1 shows that for the two small datasets with 200

120

Mean absolute error fastBVSR(a) fastBVSR(b) GCTAHeight-ind 0.0217 0.0399 0.0289Height-chr6 0.0271 0.0756 0.0209

Table 4.1: Mean absolute error (MAE) of the heritability estimation. The MADis calculated as the mean of the absolute difference between the true and theestimated heritabilities. See Fig. 4.1 for the simulation settings.

causal SNPs, the heritability can be very accurately estimated from the posterior.

GCTA seems unbiased too but in the Height-ind dataset its estimates have a larger

mean absolute error than the posterior mean estimates of fastBVSR(a) (Table 4.1).

By comparing the two panels in the middle column, it can be seen the collinearity

of the Height-chr6 dataset slows down the convergence of fastBVSR.

For the Height-5C dataset, the number of causal SNPs is set to 1000 and the

phenotypes are simulated with heritability h = 0, 0.05, . . . , 0.95. The GCTA esti-

mates for the heritabilities appear to be still unbiased, though with relatively large

variance while fastBVSR shows a clear tendency to underestimate the heritability

(Fig. 4.2). To explain the behaviour of fastBVSR, first let’s recall the rationale

for variable selection. Just like the multiple comparison correction for the p-value,

if we have a larger number of candidate SNPs, by chance we could observe more

and stronger spurious signals. Hence we would require a stronger signal threshold

to include a SNP into the model. In this simulation, both the numbers of total

candidate SNPs and causal SNPs are much larger than in the previous ones (a

larger number of causal SNPs implies a smaller σ). Therefore, it becomes much

harder for fastBVSR to detect all the signals. A much larger sample size would

help fastBVSR overcome this problem.

121

Height-ind: fastBVSR(a) Height-ind: fastBVSR(b) Height-ind: GCTA

Height-chr6: fastBVSR(a) Height-chr6: fastBVSR(b) Height-chr6: GCTA

Figure 4.1: Heritability estimation with 200 causal SNPs in the Height-ind andHeight-chr6 datasets. For both datasets, we run fastBVSR with two settings:(a) 20K burn-in iterations, 100K sampling iterations and Rao-Blackwellizationevery 1000 iterations; (b) 2K burn-in iterations, 10K sampling iterations and Rao-Blackwellization every 200 iterations. We compare the results with GCTA. Thegrey bars represent 95% posterior intervals for fastBVSR and ±2SE for GCTA.

122

Height-5C: fastBVSR Height-5C: GCTA

Figure 4.2: Heritability estimation with 1000 causal SNPs in the Height-5Cdataset. We run fastBVSR with 200K burn-in iterations, 1M sampling itera-tions and Rao-Blackwellization every 10K iterations. The grey bars represent95% posterior intervals for fastBVSR and ±2SE for GCTA.

4.4.2 Calibration of Posterior Inclusion Probabilities

The posterior estimation for the model size (the number of SNPs in the model,

|γ|) in the previous simulations is shown in Fig. 4.3. As explained in Chap. 4.2.1,

when the heritability is small, the posterior distribution of the model size shows

a larger variance.

It is not realistic to recover all the true signals and accurately estimate the

model size in a real data analysis, since many SNPs with tiny effects cannot be

identified confidently via any statistical method. However, the posterior inclusion

probability (PIP) serves as a measure of how likely a SNP is truly associated with

the trait. From a practical standpoint, the calibration of the posterior inclusion

probabilities is of high importance. To study the calibration, the SNPs are divided

into different bins by PIP. Suppose in the k-th bin there are Bk SNPs with PIP

123

Height-ind: fastBVSR(a) Height-ind: fastBVSR(b)

Height-chr6: fastBVSR(a) Height-chr6: fastBVSR(b)

Height-5C

Figure 4.3: The posterior estimation of the model size, |γ|. See Fig. 4.1 andFig. 4.2 for the simulation settings. The truth is marked by the red lines. Thegrey bars represent 95% posterior intervals.

124

f1, . . . , fBk. Let Mk be the predicted number of true positives in this bin. Then

E[Mk] =

Bk∑i=1

fi , Var(Mk) =

Bk∑i=1

fi(1− fi),

and thus

E[Mk/Bk] =1

Bk

Bk∑i=1

fi , Var(Mk/Bk) =1

B2k

Bk∑i=1

fi(1− fi). (4.9)

Mk/Bk is compared with the proportion of true positives in Fig. 4.4. The Rao-

Blackwellized estimators for PIP, which was defined in (4.6), exhibit significant

improvement of the crude PIP estimates, especially when the number of MCMC

iterations is too small to achieve convergence (Fig. 4.5). In particular, for the

Height-ind datset, the calibration of the Rao-Blackwellized estimates is impres-

sively accurate. For the other two datasets with collinearity, the calibration is

arguably acceptable.

4.4.3 Prediction Performance

Prediction is one of the ultimate goals of variable selection. Even if a procedure

cannot produce good estimates for the heritability, the model size or the inclu-

sion probabilities, it is still practically useful as long as it has good prediction

performance. Consider a future observation y′,

y′ =N∑i=1

βix′i + µ+ e′ = µ+

∑γi=1

βix′i + µ+ e′.

where e′ ∼ N(0, τ−1). In GWAS, xi can be assumed to be drawn from centered

Binom(2, fi) where fi is the minor allele frequency of the i-th SNP. The covariates

x1, . . . , xN may not be independent, as in the Height-chr6 and Height-5C datasets.

125

Height-ind(a): PIP Height-chr6(a): PIP Height-5C: PIP

Height-ind(a): RB-PIP Height-chr6(a): RB-PIP Height-5C: RB-PIP

Figure 4.4: Calibration of the posterior inclusion probabilties. For Height-ind andHeight-chr6 we report the results of setting (a) (see Fig. 4.1 for details). PIPstands for posterior inclusion probability and RB-PIP means Rao-Blackwellizedestimation of posterior inclusion probability. The SNPs are divided into 20 binsby PIP (RB-PIP) and the y-axis represents the proportion of true positives ineach bin. The x-axis is the predicted proportion of true postives (see (4.9)). Thegrey bars represent ±2SD.

126

Height-ind(b): PIP Height-chr6(b): PIP

Height-ind(b): RB-PIP Height-chr6(b): RB-PIP

Figure 4.5: Calibration of the posterior inclusion probabilties for Height-ind andHeight-chr6 under setting (b) (see Fig. 4.1 for details).

127

Let β1, . . . , βN be the estimates from some procedure. Then y′ would be estimated

by

y′ =N∑i=1

βix′i.

Therefore, the mean squared prediction error (MSPE) is calculated by

MSPE(β, µ)def.= E[(y′ − y′)2 | β, µ] = E[(

N∑i=1

(βi − βi)x′i + (µ− µ) + e′)2 | β, µ]

= τ−1 + (µ− µ)2 + Var

(N∑i=1

(βi − βi)x′i

)

= τ−1 + (µ− µ)2 + (β − β)tCov(X ′)(β − β).

The covariance matrix Cov(X ′) can be simply estimated by the observed sample

covariance matrix. Since µ is usually estimated by sample mean, its mean squared

error is τ−1/n. Thus, we define the MSPE for β by

MSPE(β)def.=n+ 1

nτ−1 +

1

n(β − β)tX tX(β − β)

=n+ 1

nτ−1 +

1

n||Xβ −Xβ||22.

(4.10)

Consider two estimators for β: the optimal estimator β = β and the null estimator

β = 0. Their MSPEs are given by

MSPE(β) =n+ 1

nτ−1,

MSPE(0) =n+ 1

nτ−1 +

1

nβtX tXβ.

128

We define a metric, relative prediction gain (RPG), which will be used to measure

the performance of an estimator β, by

RPG(β)def.=

MSPE(0)−MSPE(β)

MSPE(0)−MSPE(β). (4.11)

Though τ is assumed to be known for computing MSPE, it cancels out in the

expression of RPG in (4.11).

Figure 4.6 compares the RPGs of the BVSR regression estimates with the

BLUPs (best linear unbiased predictors) used by GCTA. When the true heritabil-

ity is very small, it makes little sense to talk about RPG since the denominator,

MSPE(0)−MSPE(β) (the variation that could be explained by the predictors), is

very small and RPG will have a very large variance. For the Height-ind dataset,

fastBVSR shows a substantial advantage over GCTA. Surprisingly, even in the

setting (b) where the MCMC clearly has not attain convergence by checking the

previous plots, the Rao-Blackwellized regression estimators still perform very well.

For the Height-chr6 dataset, fastBVSR is again much better than GCTA, espe-

cially when the heritability is in (0.2, 0.8), the region of most practical interests.

However, Rao-Blackwellization does not seem to provide much improvement (even

worse when the heritability is close to 1), which probably should be attributed to

the collinearity of the dataset. For the Height-5C dataset, fastBVSR is no longer

advantageous due to two main reasons: the slow convergence caused by the enor-

mous sample space and the collinearity existing in the dataset.

4.4.4 Wall-time Usage

At last, Fig. 4.7 reports the wall time used by fastBVSR for 10K MCMC iterations

under different simulation settings. Since the calculation of the ridge estimators

129

Height-ind: fastBVSR(a) Height-ind: fastBVSR(b)

Height-chr6: fastBVSR(a) Height-chr6: fastBVSR(b)

Height-5C

Figure 4.6: Relative prediction gain of fastBVSR and GCTA for the simulateddatasets with heritability ≥ 0.05. See Fig. 4.1 and Fig. 4.2 for the simulationsettings. The RPG is computed by (4.10) and (4.11). BVSR stands for the cruderegression estimates from fastBVSR; BVSR-RB represents the Rao-Blackwellizedestimates from fastBVSR; GCTA stands for the BLUPs of linear mixed model.

130

is the most time-consuming step, the wall-time usage is mostly determined by the

posterior distribution of the model size. The simulation setting is unimportant.

Height-ind Height-chr6 Height-5C

Figure 4.7: Wall time used by fastBVSR for 10K MCMC iterations. The x-axisthe posterior mean of the model size, which mainly determines the wall time usageof the fastBVSR. For Height-ind and Height-chr6 datasets, we use the results fromsetting (a). Note that for the two small datasets, we do Rao-Blackwellization every1K iterations; for Height-5C, we do Rao-Blackwellization every 10K iterations.

131

Chapter 5

Scaled Bayes Factors

5.1 Motivations for Scaled Bayes Factors

The null-based Bayes factor, in general, is defined by

BFnull(M)def.=

p(y|M)

p(y|M0),

whereM0 denotes the null model andM denotes the model of interest. Immedi-

ately we find that its expectation under the null model is 1 since

E[BFnull | M0] =

∫p(y|M)

p(y|M0)p(y|M0)dy =

∫p(y|M)dy = 1. (5.1)

The expectation of log BFnull, however, is not zero. In fact, by Jensen’s inequality,

E[log BFnull | M0] < 0 but the exact value depends on the model (and the design

matrix in regression). Shall we fix this expected value to some constant to replace

property (5.1)? There are two practical reasons. First, log BFnull, instead of BFnull,

is often used as the measure of evidence (see Jeffreys [1961, app. B] and Kass and

Raftery [1995]). Hence aligning the null expectation of log BFnull would provide

132

more practical convenience. Second, the distribution of Bayes factors is usually

heavy tailed, for example in linear regression. Thus an observation y under the

null usually produces a Bayes factor much smaller than 1 and even if the Bayes

factor is averaged over many observations, the sample mean is hardly close to 1.

In contrast, the expectation of log BFnull would be much easier to be approximated

by sampling due to the smaller variance.

For the time being, let’s simply define the scaled Bayes factor (sBF) by

log sBF = log BFnull − E[log BFnull | M0]. (5.2)

It immediately follows that E[log sBF | M0] = 0. Recall our multi-linear regression

model given in (1.1),

y | β, τ ∼ MVN(Xβ, τ−1I),

β | τ,V ∼ MVN(0, τ−1V ),

τ | κ1, κ2 ∼ Gamma(κ1/2, κ2/2), κ1, κ2 → 0.

By Theorem 2.7 and Corollary 2.8, omitting the op(1) error, the null expectation

of log BFnull is given by

E[2 log BFnull | β = 0] =

p∑i=1

(λi + log(1− λi)) ,

where λ1, . . . , λp are the eigenvalues ofH = X(X tX+V −1)−1X t. Furthermore, if

we defineQidef.= (utiz)2 as in Theorem 2.7, where z = τ 1/2y and ui is the eigenvector

of H (please see Chap. 2 for more information), we can write the asymptotic

133

expression for sBF as

2 log sBF =

p∑i=1

λi(Qi − 1).

The new statistic is called the scaled Bayes factor since

sBF = BFnull

p∏i=1

e−λi/2√1− λi

. (5.3)

Each scaling component, e−λi/2/√

1− λi, is monotone increasing in λi. When

λi ↓ 0, the scaling goes to 1; when λi ↑ 1, the scaling goes to infinity.

0.5 1.0 1.5 2.0σ

log 1

0 sB

F(B

F)

3.5

4.0

4.5

5.0

n = 200n = 500n = 1000

Figure 5.1: The plot shows how BFnull and sBF change with σ in simple linearregression with PBF = 10−6. BFnull is in gray and sBF black. BFnull and sBF arecomputed assuming the covariate has unit variance.

Let’s focus on the independent normal prior, V = σ2I, where we have λi =

d2i /(d

2i + σ−2) (di is the singular value of X). Figure 5.1 shows how the two

Bayes factors change with the value of σ in simple linear regression. An arguable

benefit of sBF is that the term log(1− λi) has been removed. Therefore, as long

as Qi > 1, log sBF is monotone increasing in λi with other λ’s fixed. Recall our

134

result on the distribution of Qi under the alternative, given in Proposition 2.14.

λi is actually kind of a measure of power. To see this, consider the alternative

β ∼ MVN(0, τ−1σ2I), where

(1− λi)Qii.i.d.∼ χ2

1.

If λi is larger, Qi tends to be greater and thus implies a smaller p-value. For the

p-value of likelihood ratio test, this is clear since PLR is computed by comparing∑Qi to the distribution function of χ2

p. For the p-value of the Bayes factor, note

that the ratios λ1/λp, . . . , λp−1/λp also have a small effect but overall we can say a

greater Qi often implies a smaller PBF. For a fixed alternative model, in a similar

fashion it could be argued that, for some given σ, λi measures the power on the

direction of vi, the i-th right-singular vector of X, by using Proposition 2.13.

Hence, when faced with two tests with identical p-values that suggest the null

should be rejected, the scaled Bayes factor tends to favor the one with the larger

power, which itself might be a desirable property since the Bayes factor is often

thought of as a link between power and significance [Sawcer, 2010, Stephens and

Balding, 2009] (but this property is missing in the unscaled Bayes factor). The

Bayes factor, together with its property E[BFnull | M0] = 1, of course has many

advantages. However, just as a coin has two sides, we have to trade some nice

properties for some others.

5.2 An Application to Intraocular Pressure

GWAS Datatsets

The scaled Bayes factor is now applied to analyze the IOP (intraocular pressure)

dataset. For details of this dataset, see Chap. 7.7.1. Age, sex, and 6 leading

135

principal components are regressed out from the raw phenotypes (the average

IOP of the two eyes). After quantile normalization, the residuals are used as the

phenotypes for single SNP analysis. BFnull, sBF and pB are computed using prior

σ = 0.2, which represents a prior belief of small but noticeable effect size [c.f.

Burton et al., 2007].

We first compared BFnull and sBF by minor allele frequency (MAF) bins.

Different MAF bins correspond to different bins of the informativeness (λ1) of

SNPs. Figure 5.2 shows that in each bin log10 sBF ∼ log10 BFnull is roughly parallel

to the line y = x, and more importantly, the larger the MAF, the larger the

difference log10 sBF−log10 BFnull, as explained in (5.3). Another noticeable feature

is that the minimum value of sBF is larger than that of BFnull, because BFnull can

go to 0 while sBF is bounded below by e−λ1/2.

Figure 5.2: The distributions of log10 BF and log10 sBF by different bins of minorallele frequency (MAF). The bins are marked by color. In the left panel thediagonal line is y = x.

Next we examined the ranking of SNPs by different test statistics. Table 5.1

contains the top 20 SNPs in the ranking by BFnull. Rows are sorted according

to SNP’s chromosome and position. Incidentally, the top 2 hits (rs7518099 and

136

SNP Chr Pos MAF bf(y) bf(y) sbf(y) sbf(y) p(y) p(y)rs12120962 1 10.53 0.384 3.88 (5) -0.90 4.56 (4) -0.21 5.63 (5) 0.01rs12127400 1 10.54 0.384 3.61 (9) -0.90 4.29 (8) -0.21 5.34 (9) 0.01rs4656461 1 163.95 0.140 5.71 (2) -0.57 6.26 (2) -0.03 7.51 (2) 0.46rs7411708 1 163.99 0.428 3.69 (8) -0.68 4.38 (7) 0.01 5.43 (8) 0.52rs10918276 1 163.99 0.427 3.59 (10) -0.66 4.28 (9) 0.03 5.33 (10) 0.54rs7518099 1 164.00 0.140 6.04 (1) -0.61 6.58 (1) -0.07 7.85 (1) 0.38

rs972237 2 125.89 0.119 3.05 (15) -0.62 3.56 (17) -0.11 4.65 (18) 0.31rs2728034 3 2.72 0.090 3.80 (6) -0.62 4.27 (10) -0.15 5.45 (7) 0.22rs7645716 3 46.31 0.254 3.34 (11) -0.88 3.98 (11) -0.16 5.03 (11) 0.21

rs7696626 4 8.73 0.023 2.96 (18) -0.33 3.20 (42) -0.01 4.70 (16) 0.31rs301088 4 53.53 0.473 2.95 (20) -0.81 3.64 (16) -0.11 4.65 (17) 0.31

rs2025751 6 51.73 0.466 3.78 (7) -0.75 4.47 (6) -0.06 5.53 (6) 0.41rs10757601 9 26.18 0.443 3.09 (13) -0.79 3.78 (12) -0.10 4.80 (13) 0.33rs10506464 12 62.50 0.164 2.97 (17) -0.75 3.54 (18) -0.18 4.59 (19) 0.15rs10778292 12 102.78 0.140 4.00 (4) -0.75 4.54 (5) -0.21 5.68 (4) 0.02rs2576969 12 102.80 0.271 3.07 (14) -0.85 3.71 (14) -0.20 4.75 (14) 0.08rs17034938 12 102.85 0.127 3.23 (12) -0.71 3.75 (13) -0.19 4.85 (12) 0.13rs1288861 15 43.50 0.120 2.95 (19) -0.45 3.46 (20) 0.06 4.54 (21) 0.59rs4984577 15 93.76 0.367 3.02 (16) -0.64 3.69 (15) 0.04 4.71 (15) 0.56

rs12150284 17 9.97 0.353 4.95 (3) -0.75 5.63 (3) -0.07 6.75 (3) 0.38

Table 5.1: Top 20 single SNP associations by BFnull (σ = 0.2). Pos: ge-nomic position in megabase pair (reference HG18); bf(y): log10 BF(y); bf(y):log10 BF(y); sbf(y): log10 sBF(y); sbf(y): log10 sBF(y); p(y): − log10 PBF(y);p(y): − log10 PBF(y). The rankings by the three statistics are given in the paren-theses. y is obtained by permuting y once. SNP IDs are in bold if they arementioned specifically in the main text.

rs4656461) are the same for all the three test statistics. The rankings by the

three statistics are largely similar to one another, particularly so for the rankings

by BFnull and PBF. There is, however, a noticeable exception of SNP rs7696626,

whose ranking by sBF is much worse than its rankings by BFnull and PBF. Not

surprisingly, this SNP has the smallest MAF (0.023) among the 20 SNPs included

in Table 5.1. We permuted the phenotypes once and recomputed the three test

statistics. Let y be the permuted phenotypes. log sBF(y) is usually close to 0

whereas log BFnull(y) is negative.

We also tried σ = 0.5 and found that, for most top signals in Table 5.1, BFnull

137

SNP Chr Pos MAF log10 BFnull log10 sBF − log10 PBF

rs12120962 1 10.53 0.384 3.549 (6) 4.624 (5) 5.628 (5)rs12127400 1 10.54 0.384 3.271 (9) 4.346 (9) 5.339 (9)rs4656461 1 163.95 0.140 5.494 (2) 6.424 (2) 7.507 (2)rs7411708 1 163.99 0.428 3.360 (8) 4.438 (7) 5.434 (8)rs10918276 1 163.99 0.427 3.258 (10) 4.336 (10) 5.328 (10)rs7518099 1 164.00 0.140 5.829 (1) 6.758 (1) 7.852 (1)

rs972237 2 125.89 0.119 2.781 (14) 3.674 (17) 4.649 (18)rs2728034 3 2.72 0.090 3.584 (5) 4.434 (8) 5.452 (7)rs7645716 3 46.31 0.254 3.019 (12) 4.045 (11) 5.027 (11)

rs7696626 4 8.73 0.023 3.069 (11) 3.644 (18) 4.698 (16)rs2025751 6 51.73 0.466 3.443 (7) 4.527 (6) 5.527 (6)rs1081076 6 132.97 0.022 2.690 (19) 3.260 (44) 4.287 (36)rs10757601 9 26.18 0.443 2.748 (15) 3.829 (13) 4.797 (13)rs10778292 12 102.78 0.140 3.738 (4) 4.665 (4) 5.683 (4)rs2576969 12 102.80 0.271 2.745 (16) 3.777 (14) 4.745 (14)rs17034938 12 102.85 0.127 2.953 (13) 3.862 (12) 4.845 (12)rs1955511 14 32.30 0.076 2.684 (20) 3.497 (25) 4.472 (25)

rs12150284 17 9.97 0.353 4.638 (3) 5.707 (3) 6.752 (3)rs6017819 20 44.48 0.069 2.736 (17) 3.524 (24) 4.505 (23)rs279728 20 44.51 0.087 2.704 (18) 3.541 (22) 4.515 (22)

Table 5.2: Top 20 single SNP associations by BFnull (σ = 0.5). The “Pos” columngives the genomic position in megabase pair (reference HG18). The rankings bythe three statistics are given in the parentheses. SNP IDs are in bold if they arementioned specifically in the main text.

becomes smaller and sBF grows larger, which is consistent with Fig. 5.1. Note

that the p-value, PBF, is not affected by the choice of σ. The rankings of the SNPs

remained mostly unchanged. The result is provided in Table 5.2.

Lastly, although it was not our main objective, we examined the top hits in

the association result. Our analysis reproduced three known genetic associations

for IOP. Namely, the TMCO1 gene on chromosome 1 (163.9M-164.0M) which

was reported in [van Koolwijk et al., 2012]; a single hit rs2025751 in the PKHD1

gene on chromosome 6 [Hysi et al., 2014]; and a single hit rs12150284 in the

GAS7 gene on chromosom 17 [Ozel et al., 2014]. A potentially novel finding

is the gene PEX14 on chrosome 1. Two SNPs, rs12120962 and rs12127400,

138

have modest association signals. PEX14 encodes an essential component of the

peroxisomal import machinery. The protein interacts with the cytosolic receptor

for proteins containing a PTS1 peroxisomal targeting signal. Incidentally, PTS1

is known to elevate the intraocular pressure [Shepard et al., 2007]. In addition, a

mutation in PEX14 results in one form of Zellweger syndrome, and for children

who suffer from Zellweger syndrome, congenital glaucoma is a typical neonatal-

infantile presentation [Klouwer et al., 2015].

5.3 Scaled Bayes Factors in Variable Selection

5.3.1 Calibrating the Scaling Factors

Consider the Bayesian variable selection regression (BVSR) model described in

Chap. 4.2. Apparently it is not wise to directly plug in the scaled Bayes factor

defined in (5.2) since it is no longer consistent (see Friedman et al. [2001, Chap. 7.7]

for the meaning of consistency). For example, suppose the effect size σ (or the

heritability h) is given and y is generated from a linear regression model with

causal SNPs Xγ with rank(Xγ) = |γ|. We now need to choose between two

models Xγ and Xγ′ , the latter of which is given by Xγ′ = [Xγ , x∗] where

X tγx∗ = 0. Then by our asymptotic result in Chap. 2,

2 log BFnull(Xγ′)− 2 log BFnull(Xγ) = λ∗Q∗ + log(1− λ∗).

where Q∗ ∼ χ21. If we have an infinitely large sample size, then λ∗ goes to 1 and

thus we would choose model Xγ with probability one. However, for the scaled

139

Bayes factor, we have

2 log BFnull(Xγ′)− 2 log BFnull(Xγ) = λ∗Q∗ − λ∗.

Therefore, even if λ∗ = 1, the wrong model would still have a positive posterior

probability. Note that in BVSR, we also need to compute the likelihoods for γ

and γ ′, however the difference, log p(γ ′)− log p(γ), is bounded and thus does not

affect our conclusion for this simple example. The key message is that the scaled

Bayes factor given in (5.2) does not produce a sufficiently large penalty on the

model complexity.

To solve this problem, recall that our motivation for sBF was just to introduce

a scaling factor to BFnull such that

log sBF = log BFnull − E[log BFnull | M0] + logC,

where C does not depend on X given the model size. For the BVSR model,

consider letting C = C(|γ|, σ2). Since BFnull = p(y|M)/p(y|M0) , we have

sBF =C(|γ|, σ2) exp (−E[log BFnull | M0]) p(y|M)

p(y|M0).

To make sBF suitable for variable selection, it suffices to require

E[C(p, σ2) exp

(−E[log BFnull(Xγ , σ

2) | M0])| |γ| = p

]= 1, (5.4)

where the inner expectation is with respect to y and the outer is with respect to

γ. This condition immediately leads to

E [ E[sBF | M0] | |γ| = p] = 1.

140

To see why this works for BVSR, recall that in the original BVSR model, for γ

the following prior is used,

p(γ | π) = π|γ|(1− π)N−|γ|,

which may be further decomposed to

p(γ | π) = p(γ, |γ| | π) = p(|γ| | π)p(γ | |γ|, π).

Implicitly we have assumed

p(γ | |γ|, π) = p(γ | |γ|) =|γ|!(N − |γ|)!

N !, (5.5)

where N is the total number of candidate SNPs. Now by using sBF we change

the prior to

p(γ | |γ|, π) =|γ|!(N − |γ|)!

N !C(|γ|, σ2) exp

(−E[log BFnull(Xγ , σ

2) | M0])

which by condition (5.4) is a valid probability mass function. As the information in

the data accumulates, the choice of prior no longer has an influence on the posterior

and thus the scaled Bayes factor is consistent just like the Bayes factor. This is

actually a result of Bernsteinvon Mises Theorem [der Vaart, 2000, Chap. 10.2].

Note that C = C(|γ|, σ2) is chosen instead of C = C(|γ|) since otherwise the

integration over σ2 would be difficult.

The expectation given in (5.4) can be computed as follows. Using our asymp-

totic result, we can write

1

C(p, σ2)= E

[p∏i=1


],

141

where the expectation is with respect to the distribution of λ = (λ1, . . . , λp)

induced from the uniform conditional distribution of γ given in (5.5). This integral

can be approximated by

logE

[p∏i=1


]≈ p logE

[e−λ/2√1− λ

], (5.6)

where the expectation on the r.h.s. is with respect to the mixture distribution of

λ1, . . . , λp. In the BVSR model, according to Proposition 2.13, we have

λi =d2i

d2i + 1/σ2

where di is the i-th singular value ofXγ . Let η = d2 denote an arbitrary eigenvalue

of X tγXγ . Its distribution is easy to characterize by sampling and in fact changes

very little with the choice of |γ|. Let’s also define φ = σ2 (to simplify the following

expressions) and write

g(η) = e−ηφ/2(ηφ+1)√

1 + ηφ =e−λ/2√1− λ

.

By Taylor expansion, we have

E [g(η)] =∑k

1

k!E[(η − E[η])k

]g(k)(E[η]).

In our implementation, a fourth-order approximation is used. The derivatives of

142

g are given by

g′′(η) =e−ηφ/2(ηφ+1)

4(ηφ+ 1)7/2

2φ2 − η2φ4

,

g(3)(η) =e−ηφ/2(ηφ+1)

8(ηφ+ 1)11/2

−16φ3 − 18ηφ4 + 3η3φ6

,

g(4)(η) =e−ηφ/2(ηφ+1)

16(ηφ+ 1)15/2

156φ4 + 320ηφ5 + 180η2φ6 − 15η4φ8

.

The performance of our method is checked by simulation. The Height-10K

and Height-chr6 datasets described in Chap. 4.4 are used. For each given |γ|, γ

is sampled until we have collected 100K singular values of Xγ . Then the ideal

scaling factor (l.h.s. of (5.6)) and the Taylor series approximation to the r.h.s.

of (5.6) using σ = 0.2 are computed. The results given in Table 5.3 demonstrate

that our method is accurate enough.

Height-10K Height-chr6

|γ| A|γ| A|γ| A|γ| A|γ|

1 1.505 1.505 1.504 1.505

10 15.06 15.06 15.02 15.03

50 75.08 75.20 74.72 75.00

100 150.3 150.4 148.6 149.6

200 297.5 299.8 293.3 297.2

Table 5.3: Taylor series approximations for the ideal scaling factors with σ =

0.2. A|γ| = logE[∏e−λi/2/

√1− λi] denotes the ideal scaling factor; A|γ| =

p logE[e−λ/2/√

1− λ] where the expectation is evaluated using Taylor expansion.

143

5.3.2 Prediction Properties

A natural question is what are the differences between the Bayes factor and the

scaled Bayes factor in prediction? The answer would be very complicated and

depends on the concrete problem. To gain some insights, let’s consider a toy

example. Suppose we have two SNPs x1 and x2 (centered). We need to select

between two models (σ2 is given):

(1) M1 : y = β1x1 + ε, β1 ∼ N(0, σ2), ε ∼ MVN(0, τ−1I);

(2) M2 : y = β2x2 + ε, β2 ∼ N(0, σ2), ε ∼ MVN(0, τ−1I).

Let si =∑xtixi (i = 1, 2) and s12 =

∑xt1x2. For the i-th model, the posterior

for βi and the Bayes factor are given by

βi | Mi, τ ∼ N(xtiy

si + σ−2,

τ−1

si + σ−2);

BFnull(xi) = (σ2si + 1)−1/2

(1− (xtiy)2

yty(si + σ−2)

)−n/2.

To avoid confusion let β = (β1, β2) be the true (realized) value of β. Then

by (4.4.3), the mean squared prediction error (MSPE) of some estimate β is

MSPE(β) =1

n

(n+ 1)τ−1 + s1(β1 − β1)2 + s2(β2 − β2)2 + 2s12(β1 − β1)(β2 − β2)

.

(5.7)

Using y ∼ MVN(Xβ, τ−1I), we can define

Zidef.= τ 1/2s

−1/2i xtiy ∼ N(τ 1/2

s

1/2i βi + s

−1/2i s12β3−i

, 1), (5.8)

144

and we can show

xtiy

si + σ−2=τ−1/2s

1/2i Zi

si + σ−2∼ N(

siβi + s12β3−i

si + σ−2,

τ−1si(si + σ−2)2

).

Assume n is sufficiently large. Then we have

xtiy

si + σ−2≈ τ−1/2s

−1/2i Zi ∼ N(βi + s12s

−1i β3−i, τ

−1s−1i ),

and by Theorem 2.7,

2 log BFi =siσ

2

siσ2 + 1Z2i − log(siσ

2 + 1),

2 log sBFi =siσ

2

siσ2 + 1(Z2

i − 1),

where we have used the abbreviation BFi = BFnull(xi). Since the two models have

the same dimension, we don’t need to compute the ideal scaling factor for sBF

and the posterior probability of each model (denoted by wi) is given by

wBi =BFi

BF1 + BF2

, wSi =sBFi

sBF1 + sBF2

.

By the model averaging principle, the Bayesian variable selection estimate, de-

noted by β, is

β = (w1xt1y

s1 + σ−2, w2

xt2y

s2 + σ−2).

Plugging the expressions for β and wi into (5.7) and using (5.8), we can calculate,

only numerically, the expected MSPE for a Bayesian procedure. To obtain some

145

analytic conclusions, let’s set Z1, Z2 to their expected values and let

ρdef.=

s12√s1s2

be the correlation between x1 and x2. Then for the three terms on the r.h.s.

of (5.7) we have

s1(β1 − β1)2 = s1β1 − w1(β1 + ρ√s2/s1β2)2

s2(β2 − β2)2 = s2β2 − w2(β2 + ρ√s1/s2β1)2

2s12(β1 − β1)(β2 − β2) = 2ρ√s1s2β1 − w1(β1 + ρ

√s2/s1β2)β2 − w2(β2 + ρ

√s1/s2β1).

Using w1 + w2 = 1, we obtain

MSPE(β) =1

n

[(n+ 1)τ−1 + (1− ρ2)

s1β

21(1− w1)2 + s2β

22w

21 − 2ρ

√s1s2β1β2w1(1− w1)

].

By differentiating, we can compute the optimal value for w1 is

w∗1 =s1β

21 + ρ

√s1s2β1β2

s1β21 + s2β2

2 + 2ρ√s1s2β1β2

. (5.9)

For some special circumstances enlightening conclusions can be drawn.

(1) β2 = 0. Then w∗1 = 1, which means that we should simply choose M1,

consistent with our intuitions. Consider the sample size n grows to infinity.

Then both 2 log(BF1/BF2) and 2 log(sBF1/sBF2) grows at rate O((1− ρ2)n).

Hence both statistics would yield w1 = 1 when n is sufficiently large.

(2) s1β21 = s2β

22 and s1 > s2. Then w∗1 = 1/2. In this case Z2

1 = Z22 and thus the p-

values of the two models would be equal. According to the previous discussion,

we have sBF1 > sBF2. But since 2 log sBF takes the form λ(Z2 − 1) where

146

λ → 1 as n → ∞, for large sample sizes we have sBF1 ≈ sBF2 and thus

w1 = 1/2, which is optimal. However, BF1/BF2 →√s2/s1. Hence in this

case, the scaled Bayes factor is more advantageous.

Another useful observation is that if s1β21 > s2β

22 , we have w∗1 > 1/2. Therefore,

if the true effect sizes, β1 and β2, are equal (in absolute value), we should favor

the SNP with larger variance; if the two SNPs have the same variance, we should

favor the SNP with larger effect size (in absolute value).

5.4 Simulation Studies for Variable Selection

The experiments done in Chap. 4.4 can be also performed using the scaled Bayes

factor. The goal of this section is to show that our method is valid and sBF

works at least as well as BF. Permutation is used to compute E[log BFnull | M0].

The scaling factor is then calibrated by computing C(|γ|, σ2), using the method

described in Chap. 5.3.1. Only the Height-ind and Height-chr6 datasets are used

and we run MCMC with 20K burn-in iterations, 100K sampling iterations and

Rao-Blackwellization every 1000 iterations (this was the setting (a) in Chap. 4.4).

Figure 5.3 shows the results for heritability estimation with different number of

permutations. As a larger number of permutations reduce the estimation variance

for E[log BFnull | M0], the heritability is most accurately estimated when we

permute 20 times. Moreover, using the mean absolute error metric, when we

permute 20 times, the scaled Bayes factor produces more accurate results than

the Bayes factor (Table 5.4). It may be a little surprising that the result for sBF

with only 1 permutation also looks good enough. This can be explained using

our asymptotic results. Recall from Theorem 2.7, asymptotically 2 log BFnull =

147

sBF: number of permutations

Mean absolute error BF 1 2 5 10 20

Height-ind 0.0217 0.0434 0.0343 0.0260 0.0221 0.0203Height-chr6 0.0271 0.0628 0.0504 0.0338 0.0266 0.0242

Table 5.4: Mean absolute error (MAE) of the heritability estimation using thescaled Bayes factor. The MAD is calculated as the mean of the absolute differencebetween the true and the estimated heritabilities. See Fig. 5.3 for more relatedresults.

∑λiQi+log(1−λi) where under the null Qi

i.i.d.∼ χ21. Hence, by direct calculations

E[1

BFnull

| M0] =∏ 1√

1− λ2i

.

By the reasoning of pseudo-marginal MCMC methods [Andrieu and Roberts,

2009], the MCMC with 1 permutation converges to a stationary distribution in-

duced by a (scaled) Bayes factor (denoted by sBF(1)) with form

log sBF(1) =

p∑i=1

1

2(λiQi − log(1 + λi)) + logC(p, σ2)

where Qi was defined in Theorem 2.7. The term − log(1 + λi) is close to the true

penalty term in sBF, which is −λi, compared with the original penalty log(1−λi).

The posterior inclusion probabilities can still be improved by Rao-Blackwellization,

via the method proposed in (4.9). Again the results for the scaled Bayes factor are

very similar to that for the Bayes factor (Fig. 5.4). For prediction, the scaled Bayes

factor is also as good as the Bayes factor (Fig. 5.5). In fact, for the Height-chr6

dataset, the scaled Bayes factor is better when the true heritability is large.

148

1 permutation 5 permutations 20 permutations BFnull

Figure 5.3: Heritability estimation with sBF in the Height-ind (first row) andHeight-chr6 datasets (second row). The first three columns correspond to differentnumber of permutations used in computing the null expectation of BF. The BVSRresults for BF are given in the last column for comparison. The grey bars represent95% posterior intervals.

149


Figure 5.4: Calibration of the Rao-Blackwellized posterior inclusion probabiltiesfor sBF. The first row corresponds to the Height-ind dataset and the second theHeight-chr6 dataset. The first three columns correspond to BVSR with sBF anddifferent number of permutations for computing the null expectation of BF, andthe last column is the BVSR result obtained using BF. The SNPs are divided into20 bins by RB-PIP and the y-axis represents the proportion of true positives ineach bin. The x-axis is the predicted proportion of true postives. The grey barsrepresent ±2SD.

150


Figure 5.5: Relative prediction gain of the scaled Bayes factor for the simulateddatasets with heritability ≥ 0.05. The first row corresponds to the Height-inddataset and the second the Height-chr6 dataset. The first three columns corre-spond to BVSR with sBF and different number of permutations for computingthe null expectation of BF, and the last column is the BVSR result obtained usingBF. The RPG is computed by (4.10) and (4.11). In the legends BVSR standsfor the crude regression estimates from fastBVSR; BVSR-RB represents the Rao-Blackwellized estimates from fastBVSR; GCTA stands for the BLUPs of linearmixed model output from GCTA.

151

Chapter 6

Summary and Future Directions

6.1 Summary of This Work

Bayesian linear regression has a wide application in genetics. Two important ex-

amples are association testing and variable (causal SNP) selection. This work

studied both the theoretical and computational aspects of Bayesian linear regres-

sion, and provided examples of applications to genome-wide studies. We started

from the characterization of the null distribution of the Bayes factor given in (1.3).

Under the null,

2 log BFnull =

p∑i=1

(λiQi + log(1− λi)) + op(1), (6.1)

where Q1, . . . , Qp are i.i.d. χ21 random variables, λ1, . . . , λp are weights between 0

and 1 and op(1) is the error that vanishes in probability. Under the alternative,

assuming some conditions that guarantee the error vanishes, we still have the same

asymptotic form for 2 log BFnull but Q1, . . . , Qp become noncentral chi-squared

random variables. The proof was given in Chap. 2.1. An immediate impact is on

the calculation of p-values for Bayesian methods in GWAS. Due to the burden of

152

multiple testing, the significance threshold in GWAS is very small, typically 5 ×

10−8. Consequently, the permutation approach to computing p-values associated

with Bayes factors is not feasible. Using the asymptotic result (6.1) such p-values

can be analytically computed. In Chap. 2.2 the behaviour of the Bayes factor

and its associated p-value was discussed. The computation of p-values requires

the evaluation of the distribution function of a weighted sum of independent χ21

random variables. To overcome this, we implemented in C++ a recent polynomial

method of Bausch [2013], which appears to be the most efficient solution so far.

A striking feature of our implementation is that even extremely small p-values

can be accurately computed. Besides, arbitrary precision is attainable and strict

error bounds are provided. More details were given in Chap. 2.3.2 and 2.3.3.

Simulations studies (see Chap. 2.3.4) showed that the p-values computed using

our asymptotic result have very good calibration, even at the tail, i.e., when the

associated Bayes factor is very large.

The expression of the Bayes factor (Eq. (1.3)) contains a term (X tX+V −1)−1X ty,

the posterior mean of the regression coefficient (also the maximum a posteriori

estimator). It is often computed via Cholesky decomposition of (X tX + V −1),

which has cubic complexity in p (the number of columns of X) and thus extremely

slow for large p. A novel iterative method, called ICF (iterative solutions using

complex factorization), another major contribution of this work, was proposed in

Chap. 3. Simulation (Chap. 3.5) shows that, when ICF is applicable, it is much

better than the Cholesky decomposition and other iterative methods like Gauss-

Seidel algorithm. The only limitation of ICF is that it relies on the availability

of the Cholesky decomposition of X tX. Fortunately, in the MCMC sampling of

a Bayesian variable selection procedure, such decompositions are often easy to

obtain by efficient updating algorithms. We studied the BVSR (Bayesian variable

selection regression) model proposed by Guan and Stephens [2011] (see Chap. 4.2),

153

which turned out to fit well with the ICF algorithm. Our new MCMC algorithm

for the inference of BVSR model was described in great detail in Chap. 4.3. Apart

from the ICF algorithm, the exchange algorithm proposed by Murray et al. [2012]

was employed to bypass the calculation of matrix determinants. Simulation studies

(Chap. 4.4) showed that the new algorithm can efficiently estimate the heritabil-

ity of a quantitative trait and report well-calibrated posterior inclusion probabil-

ities. Furthermore, compared with another popular software package GCTA (see

Chap. 7.5), it has much better performance in prediction (Chap. 4.4.3).

The last, but by no means the least, novelty of this work is a new statistic called

scaled Bayes factor (Chap. 5). It was motivated by the null distribution of the

Bayes factor given in (6.1). See Chap. 5.1 for its practical and theoretical benefits.

Chap. 5.2 gave an application of sBF to whole-genome single-SNP analysis of a real

GWAS dataset on intraocular pressure. Some known associations were replicated

and a potentially novel finding, PEX14, was described. The scaled Bayes factor

could also be used for variable selection. The method is a little more complicated

since the scaling factor of sBF requires further calibration with the data (see

Chap. 5.3.1). Simulation studies in Chap. 5.4 demonstrated that sBF performs at

least as well as the unscaled Bayes factor.

6.2 Specific Aims for Future Studies

6.2.1 Bayesian Association Tests Based on Haplotype or

Local Ancestry

Background The asymptotic distribution of the Bayes factor in Bayesian lin-

ear regression was studied in Chap. 2 and a software package was provided for

154

computing the p-values associated with the Bayes factors. A GWAS application

of this result was given in Chap. 5.2. Nevertheless, in that GWAS data analysis

we only considered the single-SNP analysis. By Proposition 2.12, in simple linear

regression, PBF (the p-value associated with the Bayes factor) is asymptotically

equal to PLR (the p-value of the likelihood ratio test). Hence the use and the

importance of PBF was not emphasized. For multi-locus association testing, PBF

may behave very differently from the p-values of the traditional tests (PLR and

PF). The calculation for PBF would also be much harder but could be efficiently

done by our program BACH.

Typical examples of multi-locus testing include the sequential kernel associ-

ation test of Ionita-Laza et al. [2013], the semiparametric regression test with

least-squares kernel machine of Kwee et al. [2008] and some other tests based on

pooling rare variants. For these methods, the test statistic is distributed as a

weighted sum of independent chi-squared random variables and thus our program

BACH could be applied.

Here we consider two new ideas. First, we may perform multi-locus association

testing using haplotypes. According to the linkage disequilibrium, chromosomes

can be divided into smaller haplotype blocks such that in each block, the dis-

tributions of the SNPs are highly dependent and thus the combinations of these

SNPs form specific patterns. The inference of haplotypes from genotypes is called

phasing (see Chap. 1.1.1 for more information). To test the association between

the phenotype and a haplotype block, we build a multi-linear regression model

using both the SNPs in that block and the haplotypes (represented by dummy

variables). From a genetic perspective, such tests are appealing. In singe-SNP

analysis, scientists are much more interested in the region surrounding an asso-

ciation signal rather than the signal itself. A haplotype block usually lies in a

155

specific region, e.g. a gene, and thus represents direct biological interest. From

a statistical perspective, haplotype blocks contain more information than single

SNPs and the total number of tests becomes much smaller, resulting in a much

milder multiple testing correction. Therefore such tests tend to be more powerful

than single-SNP analysis. Another example is the association test based on local

ancestry. For an admixed population, it is often beneficial to infer the ancestral

proportions at each locus. For example, Mexican is an admixed population of

three ancestral populations: European, native American, and African. The local

ancestry analysis of Mexican samples revealed a strong selection signal at the MHC

region [Zhou et al., 2016]. Programs for local ancestry inference include ELAI [Xu

and Guan, 2014], RFMix [Maples et al., 2013], LAMP-LD [Baran et al., 2012],

etc. ELAI uses a two-layer model where the lower layer (the more recent layer)

typically contains 10 to 20 clusters of ancestral haplotypes. Hence, just like the

previous example, for each SNP, we may use the genotype together with the local

ancestry (represented by dummy variables) as the regressors and fit a Bayesian

multi-linear regression model. The Bayes factor and its associated p-value are

then used to detect significant associations.

Methods and Materials For the haplotype association testing, we may simply

use the IOP dataset which was analyzed in Chap. 5.2. There are two ways to

define the haplotype blocks. First, we can simply use a fixed block size. This is

also the strategy used by most phasing and local ancestry inference programs, for

example IMPUTE2 and LAMP-LD. Second, we may define the blocks according

to the biological functions or the linkage disequilibrium degrees. For example,

the SNPs located within the same gene (including the upstream and downstream

regions that may include promoters) should be grouped into blocks. We can also

compute the linkage disequilibrium and group the SNPs such that in each block

156

LD is greater than some threshold. Then for each block, we perform a Bayesian

multi-linear regression and compute the Bayes factor and its associated p-value.

We write the model as

y = Xβ +Hu+ ε,

εi | τi.i.d.∼ N(0, 1/τ),

β | τ,Vβ ∼ MVN(0, τ−1Vβ),

u | τ,Vu ∼ MVN(0, τ−1Vu),

p(τ) ∝ 1/τ,

(6.2)

where X represents the genotype of the SNPs in the block and H represents

the haplotypes. If there are k different haplotypes in the block, then H has

k − 1 columns (dummy variables). Hij = 1 means the i-th subject has the j-

th haplotype. Note that unlike the traditional dummy variables, the entries of

Hij do not have to be integers. For example, if haplotypes are inferred from

a Bayesian procedure, we may obtain the posterior probability for each possible

haplotype. (Of course, we can also compute the Bayes factor for every possi-

ble inference of H and compute a weighted average over these Bayes factors.)

For the association testing using local ancestry, we need a dataset of admixed

population. One candidate is a dbGaP dataset, Mexican hypertriglyceridemia

study (accession number: phs000618.v1.p1), which contains 4, 350 case and con-

trol samples [Weissglas-Volkov et al., 2013]. The dataset contains HDL (high-

density lipoprotein) measurements which can be used as the phenotype for our

analysis. At each SNP locus, we first infer the ancestral ancestry and then use

them to build a multi-linear regression model for association testing. The model

has the same form as (6.2). However, X has only one column and H includes the

dummy variables that represent the local ancestry. The local ancestry inference

program, ELAI [Xu and Guan, 2014], can output the probabilistic estimates for

157

every ancestral population.

An important advantage of Bayesian analysis is the model flexibility. We may

combine the two design matrices in (6.2), X and H , and use V = diag(Vβ,Vu)

to represent the prior covariance matrix. Although in Chap. 2, only two special

choices for V were discussed (independent normal prior and g-prior), V actually

can be any positive definite matrix. For both association testing methods, how

to choose an appropriate V would be a critical problem. When Vu = 0, the test

reduces to the ordinary multi-locus test. Clearly, we would like to specify different

effect sizes for Vβ and Vu. To average over prior uncertainties, we may try different

priors and then average over the Bayes factors.

Simulation studies can be performed to compare the performance of our meth-

ods with other methods, including the non-Bayesian multi-linear regression tests

and other haplotype-based methods, for example the haplotype-sharing method

of Nolte et al. [2007]. Besides, such studies can also be used to compare differ-

ent phasing and local ancestry inference programs. For example, LAMP-LD and

RFMix use one-layer model to infer the local ancestry and thus for Mexican sam-

ples, we only have two dummy variables. For ELAI, in contrast, we can have more

than ten regressors due to the two-layer modelling (the upper layer only contains

three ancestral populations but the lower layer can contain 10 to 20 haplotype

groups). It would be interesting to know whether the two modelling methods

produces different association testing results. An existing software package for

simulating admixed populations is cosi2 [Shlyakhter et al., 2014].

158

6.2.2 Application of ICF to Variational Methods for

Variable Selection

Background The BVSR method described in Chap. 4 is quite powerful when

the dataset only has a moderate number of SNPs. But for large datasets that

contain more than 100K SNPs, 1000 of which are causal, BVSR has a clear ten-

dency to underestimate the heritability (see the simulation results for Height-5C

dataset in Chap. 4.4). A real example is the inference for the heritability of height

using the Height dataset. The GCTA estimate for the heritability is 0.44 [Yang

et al., 2010]. On the contrary, the BVSR estimate for PVE (proportion of variance

explained) is only 0.15, as reported in Zhou et al. [2013]. Our new implementa-

tion, fastBVSR, has also been tried but the heritability estimate is still around

0.2 (result not shown in this work). One Bayesian approach to solve this method

is to use the BSLMM (Bayesian Sparse Linear Mixed Model) model of Zhou et al.

[2013]. BSLMM assumes the following prior for β:

βii.i.d.∼ π N(0, (σ2

a + σ2b )/τ) + (1− π)N(0, σ2

b/τ).

For comparison, the prior for β in BVSR could be written as

βii.i.d.∼ π N(0, σ2/τ) + (1− π) δ0.

Hence, BSLMM essentially assumes that every SNP has a contribution to the

phenotype but, for most of them, the effect is tiny. The variable selection of

BSLMM aims to identify the SNPs with relatively large effects. The rationale of

BSLMM can be seen as a mixture of BVSR and GCTA. Using this method, the

PVE estimate for the Height dataset is 0.41 [Zhou et al., 2013]. Our algorithm for

computing ridge estimators, ICF, which was described in Chap. 3 could be applied

159

to BSLMM and a substantial improvement on the running speed is expected. Due

to the similarity between BVSR and BSLMM, the implementation of BSLMM

using ICF is easy.

For the Height dataset, the failure of BVSR to produce a heritability estimate

comparable to that of GCTA is not necessarily caused by the model specification.

It might simply be due to the computational limitations. The posterior inference

for BVSR is made via MCMC but the Height dataset contains about 300K SNPs

which implies that the MCMC is almost impossible to converge with a couple of

millions of iterations. To make things worse, the number of causal SNPs is very

large, probably much greater than 1000. Hence it is entirely likely that there

exist models with large posterior probabilities and model size greater than 1000

but BVSR cannot find them. In fact, for any problems with so many potential

predictors (and “true” predictors), MCMC becomes much less reliable. Note that

BSLMM effectively makes the model size much smaller and, for traits like height,

the heritability estimation largely depends on the estimation of the parameter σb.

In Zhou et al. [2013], it is reported that the proportion of variance explained by

the sparse effects is only 0.12 for height.

Another Bayesian strategy to solve this problem that does not use MCMC is

the variational methods, which shall be the focus of our second aim. See Jordan

et al. [1999], Bishop [2006], Grimmer [2011] among others for an introduction.

The idea of variational inference was briefly explained in Chap. 4.1. To be more

specific, consider the following approximating form for the posterior distribution

of (β,γ),

q(β,γ) =N∏j=1

φjfj(βj)γj (1− φj)δ0(βj)1−γj , (6.3)

where N is the total number of SNPs and γ is the variable indicating whether

160

the SNP is included in the model. By integrating out βj, it can be seen that

φj = P(γj = 1) is actually the posterior inclusion probability (PIP). fj is some

distribution to be estimated and we restrict it to be normal. Note that (6.3)

cannot be the true posterior because we have assumed the posterior independence

between the SNPs! The variational inference aims to find an approximation with

form (6.3) that minimizes the Kullback-Leibler divergence between q and the true

posterior,

KL(q(β,γ) ‖ p(β,γ | y)) =

∫q(β,γ) log

p(β,γ | y)

q(β,γdβdγ. (6.4)

The search for such an optimal approximating distribution is done by some de-

terministic algorithm and thus requires much less computation than the MCMC

approach. Such algorithms are usually iterative and conceptually resemble the

EM (expectation-maximization) algorithm.

Methods A potential application of our ICF algorithm is a very recent varia-

tional method proposed by Huang et al. [2016]. It is based on an earlier variational

algorithm of Carbonetto and Stephens [2012], which was shown in the paper to be

able to produce accurate estimates for the hyperparameters under a wide range of

settings, although the individual PIP estimate was often off. The method of Huang

et al. [2016] could produce more accurate estimates for PIPs and has better con-

vergence properties. The consistency was proved for exponentially growing (w.r.t.

the sample size) number of covariates.

161

Recall the BVSR model specified by (4.2).

y | γ,β,X, τ ∼ MVN(Xγβγ , τ−1I),

γj ∼ Bernoulli(π),

βj | γj = 1, τ ∼ N(0, σ2/τ),

βj | γj = 0 ∼ δ0.

(6.5)

For the time being, we treat the hyperparameters τ, π, σ2 as fixed. Let µj and vj

be the mean and the variance of the normal distribution fj in (6.3). Carbonetto

and Stephens [2012] proposed to update (µj, vj, φj) sequentially for each j in each

iteration. Huang et al. [2016] showed that a better approach is to do the batch-

wise update. In each iteration, they proposed to first update vj : j = 1, . . . , p,

then µj : j = 1, . . . , p and lastly φj : j = 1, . . . , p. It turned out that the

computational cost mainly come from the updating of µj : j = 1, . . . , p. Let

µ = (µ1, . . . , µN) and Φ = diag(φ1, . . . , φN). The updating equation for µ, at the

k-th iteration, can be written as

µ =[Φ(k)X tXΦ(k) + nΦ(k)(I −Φ(k)) + σ−2Φ(k)

]−1Φ(k)X ty. (6.6)

The complexity of the matrix inversion is O((n ∧ N)3) (n is the sample size).

Huang et al. [2016] used the Woodbury identity to convert the problem into the

inversion of a much smaller matrix. However, when the number of causal SNPs

is very large such inversions could be still very time-consuming. Let Aγ denote

the submatrix (or subvector) of A that corresponds to the SNPs with φj > 0. We

may rewrite (6.6) as

µγ =(Φ(k)γ )−1

[X tγXγ + n((Φ(k)

γ )−1 − I) + σ−2(Φ(k)γ )−1

]−1X tγy. (6.7)

162

Since n((Φ(k)γ )−1−I)+σ−2(Φ

(k)γ )−1 is a diagonal matrix, ICF can be applied. The

Cholesky decomposition for X tγXγ can still be obtained by updating. If the initial

values are appropriately chosen such that φ(0)j : j = 1, . . . , p are not too far away

from the truth, by using ICF we also avoid the computation of the entire gram

matrix X tX. In the BVSR model, we put a hyperprior on the hyperparameters

τ, π, σ2. To average over the hyperprior distributions and obtain the posterior

inference for the hyperparameters, we can use the importance sampling approach

proposed by Carbonetto and Stephens [2012].

This novel algorithm could be extremely useful for problems like the heritability

estimation of the Height dataset, where we have a very large number of causal

SNPs with only tiny effects. For the original algorithm, the PIPs of many causal

SNPs could be set to zero since they are too small. However, if we sum up the

effects of these SNPs, the total effect is not negligible at all. By using ICF, we

can accurately estimate these PIPs within an acceptable computational time. As

shown in Chap. 3.5, when there are more than 1000 SNPs in the model, the speed

advantage of ICF over all the other methods is extremely significant.

6.2.3 Extension of This Work to Categorical Phenotypes

Background In genetic studies, very often the phenotype is the case-control sta-

tus and then the Bayesian linear regression model is not directly applicable. Hence

the extension of our results to categorical phenotypes would be of high practical

importance. The standard approach to analyzing categorical phenotypes by re-

gression is to introduce a logit or a probit link function. For binary phenotypes,

163

this means

logitP(yi = 1 | β) = logP(yi = 1 | β)

P(yi = 0 | β)= xt(i)β

P(yi = 1 | β) = Φ(xt(i)β),

(6.8)

where x(i) = (1,xi1, . . . ,xip), β = (β0, β1, . . . , βp) and Φ denotes the cumulative

distribution of the standard normal distribution. (Note that we cannot assume

β0 = 0.) Unfortunately, the inference for either model is not easy due to the lack

of conjugate prior. In particular, β cannot be integrated out in the expression

for the marginal likelihood. Take the logistic regression model as an example. Its

marginal likelihood is given by

p(y) =

∫ n∏i=1

(ex

t(i)β

1 + ext(i)β

)yi (1

1 + ext(i)β

)1−yip(β)dβ.

The model (6.8) has another more convenient formulation. Introduce latent vari-

ables z1, . . . , zn such that yi = Izi>0. Then the logistic regression model is equiv-

alent to stating that

zi − xt(i)β ∼ Logistic

and the probit model is equivalent to

zi − xt(i)β ∼ N(0, 1).

Methods To extend our results on the null distribution of Bayes factors to the

binary phenotypes, we need first work out a closed-form expression for the Bayes

factor. One solution is to use the Laplace approximation [Kass and Raftery, 1995],

164

which uses Taylor expansion to approximate the marginal likelihood by

∫p(y|β)p(β)dβ ≈ p(y|β)p(β)(2π)(p+1)/2|Σ|−1/2,

where β is the MAP (maximum a posteriori) estimator and Σ = D2l(β) and

l(β) = log p(y|β)p(β). However, the distribution of the corresponding Bayes fac-

tor is difficult to characterize. Another asymptotic approach taken by Wakefield

[2009] makes use of the asymptotic normality of the maximum likelihood esti-

mator and computes the Bayes factor as a function of the prior and the Wald

test statistic. Nevertheless, this approach defeats the purpose of computing the

p-value associated with the Bayes factor since it is always equal to the p-value of

the Wald test. For the probit model, the marginal probability P(yi = 1) can be

computed exactly in a closed form. Let the prior for β be MVN(µβ,Vβ). Then,

since zi | β ∼ N(xt(i)β, 1), we have

zi ∼ N(xt(i)µβ, 1 + xt(i)Vβx(i)),

P(yi = 1) = P(zi > 0) = Φ(xt(i)µβ√

1 + xt(i)Vβx(i)

).

The computation of the Bayes factor requires a numerical integral of the Bayes

factor for the linear regression model (integrating out z). There is no closed-form

expression unless all the observations are exactly independent, i.e., xt(i)x(j) = 0 for

all i 6= j. Overall, the null distribution of the Bayes factor for binary phenotypes

is a very challenging problem and some novel method would be necessary.

Extending BVSR to the binary phenotypes is an easier task. For the probit

model one method has already been given in Guan and Stephens [2011]. Using

the latent variable model, compared with the BVSR model for quantitative phe-

notypes, we only need an additional sampling of z in each MCMC iteration. This

165

method appears to be a variant of the Gibbs sampler of Albert and Chib [1993],

which may be viewed as the default choice for a Bayesian analysis with binary

outcomes. For the logistic regression model, similar algorithms could be devel-

oped, using the t distribution approximation to the logistic distribution proposed

by Albert and Chib [1993]. Nonetheless, the additional update of z in MCMC im-

plies that the mixing of the Markov chain is more difficult for binary phenotypes

than for quantitative ones. Hence to achieve convergence or accurate posterior

inferences, for binary phenotypes MCMC needs to be run for more iterations. It

remains a challenge to develop better MCMC algorithms for variable selection

with binary phenotypes.

166

Chapter 7

Appendices

7.1 Linear Algebra Results

The readers are assumed to have an elementary knowledge of linear algebra. No-

tations that may be confusing are explained when first used and could also be

found at the beginning of this work. The goal of this section is to introduce some

known linear algebra results that will be used in the development of our theory.

All vectors and matrices are assumed to be real unless otherwise stated.

7.1.1 Some Matrix Identities

Lemma 7.1. (Block matrix inversion formula) Let A =

A11 A12

A21 A22

be an

invertible partitioned matrix such that both A11 and A22 are square. If both A11

and S = A22 −A21A−111A12 are non-singular, then

A−1 =

A−111 +A−1

11A12S−1A21A

−111 −A−1

11A12S−1

−S−1A21A−111 S−1

.

167

S is called the Schur complement of A11 [Hogben, 2006, Part I, Chap. 10].

The formula can be proved by directly checking thatAA−1 = I. By symmetry,

A−111 +A−1

11A12S−1A21A

−111 must be equal to the inverse of the Schur complement

of A22 provided that it exists. This is known as the Woodbury matrix identity.

Lemma 7.2. (Woodbury matrix identity) If both A and S are square matrices,

then

(A+USV )−1 = A−1 −A−1U(S−1 + V A−1U)−1V A−1

provided that U and V have conformable sizes and the inverses involved ex-

ist [Harville, 1997, Chap. 18].

Another way to prove this is to check that the product of the l.h.s and the r.h.s

is just the identity matrix. See also Press [2007, Chap. 2] for more information.

Lemma 7.3. Let A be the partitioned matrix as given in Lemma 7.1. If both A11

and A22 are non-singular, then [Hogben, 2006, Part I, Chap. 10]

|A| = |A11| · |A22 −A21A−111A12| = |A22| · |A11 −A12A

−122A21|.

Proof. The first equality can be proved by using the following decomposition

I 0

−A21A−111 I

A11 A12

A21 A22

I −A−1

11A12

0 I

=

A11 0

0 A22 −A21A−111A12

.

The determinant of the l.h.s is simply |A| and the determinant of the r.h.s is

|A11| · |A22 −A21A−111A12|. The second equality can be checked similarly.

168

Lemma 7.4. (Sylvester’s determinant formula) If A is an n×m matrix and B

is an m× n matrix, then

|In +AB| = |Im +BA|

where |·| denotes the determinant and In denotes an n×n identity matrix [Sylvester,

1851].

Proof. Consider the partitioned matrix M =

In A

−B Im

. By Lemma 7.3,

|M | = |In +AB| = |Im +BA|.

Lemma 7.5. Let A + iB be a complex square matrix where both A and B are

real. If A and (A+BA−1B) are invertible, then,

(A+ iB)−1 = (A+BA−1B)−1 − iA−1B(A+BA−1B)−1.

This can be proved by calculating the real and the imaginary parts of (A +

iB)(A+ iB)−1.

7.1.2 Singular Value Decomposition and Pseudoinverse

Any matrix, real or complex, admits a factorization called singular value decompo-

sition (SVD). Before we state the form of SVD, we first review some terminologies

for complex matrices. We use M ∗ to denote the conjugate transpose of matrix

M .

Definition 7.1. (a) A complex square matrix M is said to be Hermitian if M =

M ∗.

(b) A complex square matrix M is said to be skew-Hermitian if M = −M ∗.

169

(c) A complex square matrix M is said to be unitary if MM ∗ = I.

Theorem 7.6. (Singular value decomposition) Let M be an arbitrary n× p

complex matrix. Then there exist two unitary matrices U , V and a “rectangular

diagonal” matrix D of size n× p such that

M = UDV ∗, D =

diag(d1, . . . , dr) 0

0 0

where diag(d1, . . . , dr) denotes a diagonal matrix with real diagonal elements d1, . . . , dr >

0. The singular values d1, . . . , dr are determined uniquely up to permutation and

r is equal to the rank of M .

See Allaire et al. [2008, Chap. 2.7], Serre [2002, Chap. 7.7], Harville [1997,

Chap. 21.12], etc. for proofs and more information. SVD can be used to define

the pseudoinverse for any matrix.

Definition 7.2. (Moore-Penrose pseudoinverse) The Moore-Penrose pseu-

doinverse of any complex matrix M with SVD M = UDV ∗ is denoted by M+

and defined as [Allaire et al., 2008, Chap. 2.7]

M+ def.= V D+U ∗, D+ =

diag(d−11 , . . . , d−1

r ) 0

0 0

.Proposition 7.7. The Moore-Penrose pseudoinverse M+ has following proper-

ties.

(a) MM+M = M ;

(b) M+MM+ = M+;

(c) MM+ = (MM+)∗ ;

170

(d) M+M = (M+M)∗ ;

(e) (M+)+ = M ;

(f) M+ = (M ∗M )+M ∗;

(g) M+ = M ∗(MM ∗)+;

(h) if M is invertible, M+ = M−1.

Proof. We only prove part (a) and (f). The rest can be checked easily in similar

ways.

(a) MM+M = UDV ∗V D+U ∗UDV ∗ = UDD+DV ∗ = UDV ∗ = M .

(h) IfM is invertible, thenD = diag(d1, . . . , dr) and thusMM+ = UDD+U ∗ =

UU ∗ = I. Since the matrix inversion is unique, we must have M+ = M−1.

These properties explain why M+ is called pseudoinverse. In fact, it can be

shown that the matrix M+ satisfying properties (a) to (d) is unique. See Serre

[2002, Chap. 8.4] and Harville [1997, Chap. 20], for proofs.

7.1.3 Eigenvalues, Eigenvectors and Eigendecomposition

We first review the definitions of eigenvalue and eigenvector.

Definition 7.3. Let M be a p × p complex matrix. λ is called an eigenvalue of

M if there exists a nonzero vector u such that Mu = λu. u is then called the

corresponding eigenvector.

Immediately we have the following lemma.

Lemma 7.8. Let A be an n× p matrix and B be a p× n matrix. If λ 6= 0 is an

eigenvalue of AB, then it is also an eigenvalue of BA.

171

Proof. By definition, there exists a nonzero vector u such that ABui = λui. By

multiplying both sides by B we get

BA(Bui) = λ(Bui).

We claim that Bui 6= 0, i.e., Bui is an eigenvector of BA with corresponding

eigenvalue λ. We prove this by contradiction. If Bui = 0, we have ABui = 0 =

λui. However, since λ 6= 0, this would imply ui = 0 and gives the contradiction.

Clearly we can always assume the eigenvector is normalized so that ||u||2 = 1.

Using eigenvalues and eigenvectors, some matrices admit a factorization which

is usually referred to as spectral decomposition or eigendecomposition. For the

purposes of this thesis, we only focus on a special class of matrices called normal

matrices.

Definition 7.4. A complex square matrix M is said to be normal if MM ∗ =

M ∗M .

Clearly, unitary matrices, Hermitian matrices and skew-Hermitian matrices

are normal. For real matrices, they correspond to orthogonal matrices, symmetric

matrices and skew-symmetric matrices respectively.

Theorem 7.9. (Spectral decomposition for normal matrices) If a square

matrix M is normal, it admits the factorization

M = UΛU ∗

where U is unitary and Λ = diag(λ1, . . . , λp). Each (λi,ui) is an eigenvalue-

eigenvector pair of matrix M (ui is the i-th column of U), but λ1, . . . , λp are not

172

necessarily distinct.

The set (in fact, multiset), λ1, . . . , λp is called the spectrum of M . It is

unique up to permutation. The number of times that an eigenvalue appears in the

spectrum is called its multiplicity. To see that (λi,ui) is an eigenvalue-eigenvector

pair, notice that the decomposition is equivalent to MU = UΛ. For a formal

proof, see Trefethen and Bau III [1997, Chap. 24] or Serre [2002, Chap. 3]. When

all the eigenvalues are nonnegative, by convention we assume they are ordered so

that λ1 ≥ · · · ≥ λp.

Consider a normal matrix M = XX∗. Let the SVD of X be UDV ∗. Then

the SVD for M is

M = UDD∗U ∗. (7.1)

Since U is unitary and DD∗ is diagonal, this is also the spectral decomposition

for M . Hence the nonzero singular values and the eigenvalues of M coincide.

Similarly one can show that if M = −XX∗, the nonzero singular values of M are

equal to the absolute values of the eigenvalues ofM . However, for a general square

matrix, its singular values are not equal, in absolute value, to its eigenvalues.

Proposition 7.10. For a p × p normal matrix M with eigenvalues λ1, . . . , λp,

counted with multiplicity [Trefethen and Bau III, 1997, Chap. 24],

|M | =p∏i=1

λi, tr(M ) =

p∑i=1

λi, rank(M ) =

p∑i=1

1(0,∞)(|λi|)

where tr(M ) =p∑i=1

Mii denotes the trace of M , and 1(0,∞)(|λi|) is the indicator

function that equals 1 if λi 6= 0 and 0 otherwise.

Proof. Let UΛU ∗ be the spectral decomposition of M . Then |M | = |UΛU ∗| =

173

|Λ| =p∏i=1

λi. Similarly, tr(M) = tr(UΛU ∗) = tr(U ∗UΛ) = tr(Λ) (trace is

invariant under cyclic permutations), and rank(M ) = rank(UΛU ∗) = rank(Λ).

The eigenvalues of a real matrix are not necessarily real. However, when the

matrix is Hermitian, the eigenvalues are always real. Furthermore, if the matrix

is positive definite, the eigenvalues are positive. The properties of the eigenvalues

of some special matrices are summarized in the following proposition.

Proposition 7.11. Let M be a p× p matrix and λ be an arbitrary eigenvalue of

it.

(a) If M is unitary, |λ| = 1.

(b) If M is idempotent, i.e. MM = M , λ is either 0 or 1.

(c) If M is Hermitian, λ is real.

(d) If M is skew-Hermitian, λ is either 0 or purely imaginary and λ is also an

eigenvalue of M . Hence, M has at least one zero eigenvalue if p is odd.

(e) If M is positive definite, λ > 0.

(f) If M is positive semi-definite, λ ≥ 0.

Proof. Let u be the corresponding eigenvector for λ so that Mu = λu.

(a) ||λu||22 = |λ|2u∗u = u∗M ∗Mu = u∗u. Thus |λ| = 1.

(b) Mu = M 2u = λMu = λ2u = λu. Since u is nonzero, λ = 0 or 1.

(c) On one hand, (Mu)∗u = u∗M ∗u = u∗Mu = λu∗u. On the other, (Mu)∗u =

(λu)∗u = λu∗u. Therefore (λ − λ)u∗u = 0. Because u is nonzero, we must

have λ = λ.

174

(d) Using the same argument, we obtain (λ + λ)u∗u = 0, which implies the real

part of λ is 0. By writing u = <(u)+i=(u), it is easy to show thatMu = λu.

(e) Since M is positive definite, u∗Mu = λu∗u > 0. Thus λ > 0.

(f) Same argument shows λ ≥ 0.

At last we point out that spectral decomposition provides a simple approach

to calculating the nth root of a square matrix.

Lemma 7.12. Let M be a positive semi-definite matrix with spectral decomposi-

tion UΛU ∗. Then its square root is given by M 1/2 = UΛ1/2U ∗.

It is easy to check M = M 1/2M 1/2. The proof of uniqueness can be found

in Harville [1997, Chap. 21.9] (for real matrices) and Serre [2002, Chap. 7.1] (for

complex matrices).

7.1.4 Orthogonal Projection Matrices

Let X be an n× p real matrix. Define

HXdef.= X(X tX)+X t. (7.2)

By the pseudoinverse introduced previously, immediately we have

Lemma 7.13. Let X = UDV t be the singular value decomposition of X. Then,

HX = UDD+U t,

175

where DD+ = diag(1, . . . , 1, 0, . . . , 0). The number of 1’s is equal to the rank of

X.

HX is called an orthogonal projection matrix. In the traditional linear regres-

sion, it is also called hat matrix since it maps the response vector to its fitted

values by the method of least squares.

Proposition 7.14. Assume X is an n× p matrix with n ≥ p and rank(X) = p.

For any n-vector y, we have

(X tX)−1X ty = arg minβ∈Rp

||y −Xβ||2def.= β

where || · ||2 denotes the `2-norm.

Proof. Since ||y −Xβ||22 = (y −Xβ)t(y −Xβ), we have

∂||y −Xβ||22∂β

= 2X tXβ − 2X ty.

By letting the derivative equal to 0, we obtain the expression for β. Since the

second derivative matrix 2X tX is positive definite, β indeed minimizes ||y −

Xβ||2. (For matrix differentiation, see for example Mardia et al. [1980, Appx. A].)

Proposition 7.14 implies that HXy is the projection of vector y onto the col-

umn space of X. In fact, when n < p or X is rank deficient, the claim still holds

since β = (X tX)+X ty satisfies X tXβ = X ty, though the solution is no longer

unique. The next proposition gives some important properties of HX . For more

information, see for example Harville [1997, Chap. 12] and Hogben [2006, Part I,

Chap. 5].

Proposition 7.15. Let HX be a matrix as defined in (7.2). Then,

176

(a) HX is symmetric;

(b) HX is idempotent, i.e., H2X = HX ;

(c) HXX = X;

(d) rank(HX) = rank(X);

(e) tr(HX) = rank(HX);

(f) I −HX is symmetric and idempotent;

(g) rank(I −HX) = tr(I −HX) = n− rank(HX);

(h) I −HX is an orthogonal projection matrix.

Proof. (a) The symmetry follows from the definition of HX and Proposition 7.7

(d).

(b) By Proposition 7.7 (b), H2X = [X(X tX)+X t][X(X tX)+X t] = HX .

(c) By Proposition 7.7 (a) and (f), HXX = X(X tX)+X tX = XX+X = X.

(d) By the definition of rank, it is equivalent to prove that the column spaces

of X and HX are identical. First, let Xv (v ∈ Rp) be a vector in the

column space of X. By part (c), Xv = HX(Xv), which implies it is also

in the column space of HX . Second, let HXv be a vector in the column

space of HX . By definition, HXv = X[(X tX)+X tv]. Hence it is also in the

column space ofX. Combining the two arguments, we arrive at the conclusion

rank(HX) = rank(X).

(e) By Proposition 7.10 and Proposition 7.11 (b), HX has rank(HX) eigenvalues

equal to 1 (counted with multiplicity) and n − rank(HX) zero eigenvalues.

Thus for HX , the trace is equal to the rank.

177

(f) Symmetry of I −HX is self evident. By part (b), (I −HX)(I −HX) =

I +H2X − 2HX = HX .

(g) Both HX and I −HX admit the spectral decomposition. Let the spectral

decomposition of HX be UΛU t. Since UU t = I, I −HX = U(I − Λ)U t.

The result then follows.

(h) By part (f) and Proposition 7.7 (a), I −HX can be written in the following

form that defines the orthogonal projection matrix,

I −HX = (I −HX)[(I −HX)t(I −HX)]+(I −HX)t.

In fact, any symmetric and idempotent matrix is an orthogonal projection

matrix.

Clearly, any n-vector y can be decomposed to y = HXy + (I −HX)y, i.e.,

the projection of y onto the column space of X and the projection of y onto

the orthogonal complement of that space. In linear regression, (I −HX)y is the

residuals for the least squares fitting.

At last, we prove an equality concerning the projection matrices that will be

very useful in the restricted maximum likelihood inference.

Lemma 7.16. Let X be a full-rank n× p matrix and L be a full-rank n× (n− p)

matrix such that LtX = 0. For any positive definite n× n matrix V , we have

L(LtV L)−1Lt = V −1 − V −1X(X tV −1X)−1X tV −1.

If X does not have full rank, the equality still holds with (X tV −1X)−1 replaced

by (X tV −1X)+.

178

Proof. To prove this, just notice that

HV 1/2L = V 1/2L(LtV L)−1LtV 1/2;

HV −1/2X = V −1/2X(X tV −1X)−1X tV −1/2.

Thus the lemma could be rewritten asHV 1/2L = I−HV −1/2X . Clearly, the two ma-

trices, V 1/2L and V −1/2X, are orthogonal in the sense that (V 1/2L)tV −1/2X = 0.

Since they have rank equal to n− p and p respectively, their column spaces must

be orthogonal complements to each other in the vector space Rn.

7.2 Bayesian Linear Regression

Consider the linear regression model

y = Xβ + ε (7.3)

where y = (y1, . . . , yn) is the response vector, X is an n× p design matrix and β

is a p-vector called the regression coefficients. The errors ε1, . . . , εn are assumed

to be i.i.d normal random variables with mean 0 and variance τ−1, i.e.,

ε|τ ∼ MVN(0, τ−1I).

Due to the normal error assumption, (7.3) is also referred to as the normal linear

model, and could be equivalently written as

y|β, τ ∼ MVN(Xβ, τ−1I). (7.4)

179

For a full exposition in book form of the Bayesian treatment of the normal linear

model, see, for example, O’Hagan and Forster [2004, Chap. 9], Hoff [2009, Chap. 9],

Koch [2007, Chap. 4], and Gelman et al. [2014, Chap. 14]. For readers who are

not familiar with Bayesian methodology, more introductory material can also be

found in these books.

7.2.1 Posterior Distributions for the Conjugate Priors

Throughout this thesis, only the family of conjugate priors is considered. We have

two parameters in 7.4, β and τ , and β is usually of direct interest. The conjugate

prior for β is multivariate normal distribution. The error precision, τ , can be

treated as either known or unknown.

Known error variance If τ is known, the prior for model 7.4 is simply

β|τ,V ∼ MVN(0, τ−1V ). (7.5)

where V is a positive definite matrix. More generally, we could specify a nonzero

prior mean for β however it is rarely used in practice. (See Jeffreys [1961, Chap. 5]

for more reasons.) The posterior distribution of β is still normal, since

f(β|y, τ,V ) ∝ f(y|β, τ)f(β|τ,V )

=τ (n+p)/2

(2π)(n+p)/2|V |−1/2 exp−τ

2[(y −Xβ)t(y −Xβ) + βtV −1β]

∝ exp−τ2

[βt(X tX + V −1)β − 2βtX ty].

This is the normal density kernel corresponding to the posterior distribution

β|y, τ,V ∼ MVN((X tX + V −1)−1X ty, τ−1(X tX + V −1)−1).

180

Unknown error variance If τ is unknown, we consider the following normal-

inverse-gamma conjugate prior

β|τ,V ∼ MVN(0, τ−1V ),

τ |κ1, κ2 ∼ Gamma(κ1/2, κ2/2).

(7.6)

The gamma distribution is in the shape-rate parameterization. It is called normal-

inverse-gamma prior since the prior for the error variance (τ−1) has an inverse-

gamma distribution. Thus the prior density is given by

f(β, τ) =(κ2/2)κ1/2

(2π)p/2Γ(κ1/2)|V |−1/2τ (κ1−2)/2 exp−(κ2 + βtV −1β)τ/2.

Under the prior (7.6), we have

y|τ,V ∼ MVN(0, τ−1(XVX t + I))

which leads to the marginal likelihood (after integrating out β),

f(y|τ,V ) =τn/2

(2π)n/2|I +XVX t|−1/2 exp−τ

2yt(XVX t + I)−1y. (7.7)

Hence,

f(τ |y,V ) ∝ f(y|τ,V )f(τ |κ1, κ2)

∝ τ (n+κ1−2)/2 exp−τ2

[yt(XVX t + I)−1y + κ2],

which shows that

τ |y, κ1, κ2 ∼ Gamma((n+ κ1)/2, [yt(XVX t + I)−1y + κ2]/2).

181

Non-informative prior In practice, usually there is no information guiding the

choice of κ1 and κ2, and thus the non-informative prior is preferred. The most

widely used such prior is the Jeffreys prior,

f(τ) ∝ 1/τ. (7.8)

It is improper since the integral of the density function is not finite. However, it

can be viewed as the limit of a sequence of proper gamma priors for τ and thus

written as

τ |κ1, κ2 ∼ Gamma(κ1/2, κ2/2), κ1 ↓ 0, κ2 ↓ 0. (7.9)

The posterior for τ is still proper and given by

τ |y ∼ Gamma(n/2,yt(XVX t + I)−1y/2).

7.2.2 Bayes Factors for Bayesian Linear Regression

The Bayes factor is defined as the ratio of the marginal likelihoods of two models.

In practice, it suffices to calculate only the null-based Bayes factor, which is defined

as

BFnull(M)def.=

f(y|M)

f(y|M0), (7.10)

whereM denotes the model of interest specified by (7.4), (7.5) and (7.6), andM0

denotes the null model where we assume β = 0. Explicitly, if τ is known, the null

182

model is simply

y ∼ MVN(0, τ−1I);

if τ is unknown, the null model is

y ∼ MVN(0, τ−1I),

τ |κ1, κ2 ∼ Gamma(κ1/2, κ2/2).

Clearly, if we want to compare two non-null modelsM1 andM2, the corresponding

Bayes factor is the ratio of the two null-based Bayes factors.

BF(M1 :M2)def.=f(y|M1)

f(y|M2)=f(y|M1)/f(y|M0)

f(y|M2)/f(y|M0)=

BFnull(M1)

BFnull(M2).

We will use the model parameters, including the design matrix X, to denote a

model. For example, if τ is unknown, we write M = (X,V , κ1, κ2).

Known error variance If τ is known, the null-based Bayes factor can be com-

puted from (7.7).

BFnull(X, τ,V ) =f(y|τ,V )

f(y|τ)= |I +XVX t|−1/2 expτ

2[yty − yt(XVX t + I)−1y].

By Lemma 7.2,

yt(XVX t + I)−1y = yty − ytX(X tX + V −1)−1X ty.

Hence,

BFnull(X, τ,V ) = |I +X tXV |−1/2 expτ2ytX(X tX + V −1)−1X ty,

183

where we have also applied Lemma 7.4.

Unknown error variance If τ is unknown, all we need is to integrate out τ

from from (7.7).

f(y|V , κ1, κ2) =

∫f(y|τ,V )f(τ |κ1, κ2)dτ

=(κ2/2)κ1/2

Γ(κ1/2)(2π)n/2|I +XVX t|−1/2

∫τ (n+κ1−2)/2 exp−τ

2[yt(XVX t + I)−1y + κ2]dτ

=Γ((n+ κ1)/2)κκ12

Γ(κ1)(2π)n/2|I +XVX t|−1/2

1

2[yt(XVX t + I)−1y + κ2]

−(n+κ1)/2

.

Similarly, for the null model,

f(y|κ1, κ2) =Γ((n+ κ1)/2)κκ12

Γ(κ1)(2π)n/2

1

2(yty + κ2)

−(n+κ1)/2

.

Hence,

BFnull(X,V , κ1, κ2) = |I +X tXV |−1/2

yt(XVX t + I)−1y + κ2

yty + κ2

−(n+κ1)/2

.

(7.11)

Non-informative prior Under the non-informative prior (7.8), the Bayes factor

is still proper since the “improper” normalizing constants cancel out, and is given

by


yt(XVX t + I)−1y

yty

−n/2.

By comparing with (7.11), we have

limκ1,κ2↓0

BFnull(X,V , κ1, κ2) = BFnull(X,V ).

184

Therefore the Bayes factor given in (7.2.2) can be viewed as the limit of a sequence

of Bayes factors for proper priors. By Lemma 7.2, we rewrite (7.2.2) into


yty − ytX(X tX + V −1)−1X ty

yty

−n/2.

(7.12)

7.2.3 Controlling for Confounding Covariates

Consider the linear regression model

y = Wa+Lb+ ε, ε|τ ∼ MVN(0, τ−1I),

where L is an n × p matrix that represents the covariates of interest and W is

an n × q matrix representing the covariates to be controlled for, including the

intercept term. Equivalently this model can be written as

y|a, b, τ ∼ MVN(Wa+Lb, τ−1I). (7.13)

When calculating the null-based Bayes factor, the null model becomes

y|a, b, τ ∼ MVN(Wa, τ−1I). (7.14)

The Bayes factor with proper conjugate prior Suppose the error variance is

unknown and use the following conjugate prior for model (7.13) and model (7.14),

a|τ,Va ∼ MVN(0, τ−1Va),

b|τ,Vb ∼ MVN(0, τ−1Vb),

τ |κ1, κ2 ∼ Gamma(κ1/2, κ2/2).

(7.15)

185

where both Va and Vb are positive definite. To simplify the notations, define

Σ0def.= (WVaW

t + I)−1,

Σ1def.= (WVaW

t +LVbLt + I)−1.

According to our previous calculations, we can obtain the marginal likelihoods,

f(y|Va, κ1, κ2) =Γ((n+ κ1)/2)κκ12

Γ(κ1)(2π)n/2|Σ0|1/2

[1

2(ytΣ0y + κ2)

]−(n+κ1)/2

,

f(y|Va,Vb, κ1, κ2) =Γ((n+ κ1)/2)κκ12

Γ(κ1)(2π)n/2|Σ1|1/2

[1

2(ytΣ1y + κ2)

]−(n+κ1)/2

.

Hence, the Bayes factor is

BFnull(W ,L,Va,Vb, κ1, κ2) =|Σ1|1/2

|Σ0|1/2

(ytΣ1y + κ2

ytΣ0y + κ2

)−(n+κ1)/2

.

The Bayes factor with non-informative prior By letting V −1a → 0, and

κ1, κ2 → 0, we obtain the non-informative prior,

b|τ,Vb ∼ MVN(0, τ−1Vb),

f(a, τ) ∝ τ (q−2)/2,

(7.16)

which is the Jeffreys prior for (a, τ) [Ibrahim and Laud, 1991, O’Hagan and

Forster, 2004]. Some authors may favor a simpler form f(a, τ) ∝ 1/τ , which

is also conventionally referred to as the Jeffreys prior [Berger et al., 2001, Liang

et al., 2008]. The two forms produce essentially the same proper posterior infer-

ences when n is sufficiently large.

186

Under the null model b = 0 with prior (7.16),

f(y) =

∫f(y|τ,a)f(τ,a)dτda

=

∫τ (n−2)/2

(2π)n/2exp−τ

2(y −Wa)t(y −Wa)dτda

= (2π)−(n−q)/2|W tW |−1/2

∫τ (n−2)/2 exp−τ

2[yty − ytW (W tW )−1W ty]dτ

=Γ(n/2)

(2π)(n−q)/2 |WtW |−1/2

1

2[yty − ytW (W tW )−1W ty]

n/2.

Under the alternative model, since y|τ,a,Vb ∼ MVN(Wa, τ−1(I + LVbLt)), we

have

f(y|τ,a,Vb) =τn/2

(2π)n/2|I +LVbL

t|−1/2 exp−τ2

(y −Wa)t(LVbLt + I)−1(y −Wa).

Letting Σ2def.= (I +LVbL

t)−1,

f(y|Vb) =

∫f(y|τ,a,Vb)f(τ,a)dτda

=Γ(n/2)

(2π)(n−q)/2 |WtΣ2W |−1/2|Σ2|1/2

−τ

2[ytΣ2y − ytΣ2W (W tΣ2W )−1W tΣ2y]

n/2.

To simplify the notations, first define

Pdef.= I −W (W tW )−1W t.

Next we claim

Σ2 −Σ2W (W tΣ2W )−1W tΣ2 = P − PL(LtPL+ V −1b )−1LP . (7.17)

To prove this, consider (I + φ−1WW t + LVbLt)−1 with φ > 0. By Woodbury

187

identity,

(I + φ−1WW t +LVbLt)−1 = Σ2 −Σ2W (W tΣ2W + φI)−1W tΣ2,

(I + φ−1WW t +LVbLt)−1 = Pφ − PφL(LtPφL+ V −1

b )−1LtPφ.

where

Pφdef.= (I + φ−1WW t)−1 = I −W (W tW + φI)−1W t.

Notice that both (I + φ−1WW t) and (I +LVbLt) are invertible by checking the

eigenvalues. Now let φ ↓ 0. The limit clearly exists since (W tΣ2W ) is invertible

by the positive definiteness of Vb, and limφ→0

Pφ = P . Thus by the uniqueness of the

limit we have obtained (7.17).

Letting X be the residuals of L after regressing out W , i.e.,

X = PL,

by the idempotence of P , we can rewrite (7.17) to

Σ2 −Σ2W (W tΣ2W )−1W tΣ2 = P −X(X tX + V −1b )−1X.

188

The ratio of the two determinant terms in the marginal likelihoods is

|W tΣ2W |−1/2|Σ2|1/2

|W tW |−1/2= |W tΣ2W (W tW )−1|−1/2|Σ2|1/2

= |I −W tL(LtL+ V −1b )−1LtW (W tW )−1|−1/2|Σ2|1/2

= |I − (I − P )L(LtL+ V −1b )−1Lt|−1/2|Σ2|1/2

= |I + PL(LtL+ V −1b )−1Lt(I +LVbL

t)|−1/2

= |I + PLVbLt|−1/2

= |I + VbXtX|−1/2

In the third and last equalities we have used Sylvester’s determinant formula.

Finally we obtain the expression for the Bayes factor,

BFnull(W ,L,Vb) = |I +X tXV |−1/2

ytPy − ytX(X tX + V −1)−1X ty

ytPy

−n/2.

(7.18)

One can also check that

BFnull(W ,L,Vb) = limV −1a →0,κ1,κ2↓0

BFnull(W ,L,Va,Vb, κ1, κ2).

If we define the residuals of y after regressing out W by ywdef.= Py, we can

rewrite (7.18) into

BFnull(W ,L,Vb) = |I +X tXV |−1/2

ytwyw − ytwX(X tX + V −1)−1X tyw

ytwyw

−n/2,

which has exactly the same form as (7.12). Hence it suffices to discuss the

model (7.4) with no loss of generality. It also reveals that BFnull defined in (7.18)

189

is invariant to the following transformation of y,

T (y) = c(y +Wα) , ∀c 6= 0 , α ∈ Rq,

which is a very convenient feature in simulation studies.

7.3 Big-O and Little-O Notations

The Big-O and Little-O notations are very useful for studying the limiting be-

haviour of a sequence. Depending on whether the sequence is deterministic or

stochastic, these notations have different meanings. We use O(·) and o(·) for the

deterministic sequences and Op(·) and op(·) for the stochastic sequences. The sub-

script “p” means “probabilistic”. All proofs are omitted in this section. See Cox

[2004, Chap. 3], Shao [2003, Chap. 1.5], and der Vaart [2000, Chap. 2.2] among

others for proofs and more details.

Definition 7.5. Given two sequences of real numbers an and bn and a sequence

of random variables Xn, we write

(a) an = O(bn) if and only if there exists N <∞ and C ∈ (0,∞) such that

|an| ≤ C|bn|, ∀n > N ;

(b) an = o(bn) if and only if for any ε > 0, there exists an N(ε) <∞ such that

|an| ≤ ε|bn|, ∀n > N(ε);

(c) Xn = Op(bn) if and only if for any δ > 0, there exists N(δ) <∞ and C(δ) <∞

190

such that

P(|Xn| > C(δ)|bn|) < δ, ∀n > N(δ);

(d) Xn = op(bn) if and only if for any ε > 0 and δ > 0, there exists N(ε, δ) < ∞

such that

P(|Xn| > ε|bn|) < δ, ∀n > N(ε, δ).

In particular, if Xn = Op(1), we say the sequence Xn is stochastically bounded;

if Xn = op(1), we say Xn converges to 0 in probability. There are many rules for the

operations with Big-O and Little-O symbols. The following proposition lists some

important ones that will be needed in the derivation of the asymptotic distribution

for log BFnull in Chap. 2.1.2.

Proposition 7.17.

(a) If P(Xn = an) = 1, then Xn = Op(bn) if and only if an = O(bn).

(b) If P(Xn = an) = 1, then Xn = op(bn) if and only if an = o(bn).

(c) op(O(an)) = op(an).

(d) op(an) + op(an) = op(an).

(e) Op(an) + op(an) = Op(an).

(f) Op(an)op(bn) = op(anbn).

191

7.4 Distribution of a Weighted Sum of χ21

Random Variables

7.4.1 Davies’ Method for Computing the Distribution

Function

The characteristic function of a random variable X is defined as

φX(t) = E[eitX ],

where i is the imaginary unit. Unlike the moment-generating function, the charac-

teristic function always exists (the integral is finite). Moreover, the characteristic

function uniquely determines the distribution. See Durrett [2010, Chap. 2.3],

Feller [1968, Chap XV], and Resnick [2013, Chap. 9] for more information.

Consider a random variable X ∼ χ21. Its characteristic function could be

calculated by

φX(t) = E[eitX ] =

∫1√2πx−1/2 exp−(

1

2− it)xdx = (1− 2it)−1/2. (7.19)

Note that the existence of this integral could be verified by Euler’s formula,

eitx = cos tx+ i sin tx. (7.20)

Next consider a linear combination of χ21 random variables:

Q =

p∑i=1

λiXi, Xii.i.d.∼ χ2

1. (7.21)

We are interested in the computation of the distribution function of Q. A key

192

observation is that its characteristic function is already readily available.

Lemma 7.18. The characteristic function of Q =p∑i=1

λiXi where Xii.i.d.∼ χ2

1 is

given by

φQ(t) =

p∏i=1

(1− 2iλit)−1/2.

Proof. The result in fact follows directly from the properties of the characteristic

function.

φQ(t) = E[exp(it

p∑i=1

λiXi)]

=

p∏i=1

E[exp(itλiXi)] (by the independence of X1, . . . , Xp)

=

p∏i=1

φXi(λit)

=

p∏i=1

(1− 2iλit)−1/2.

Given the characteristic function, we can calculate the distribution function by

the so-called inversion formula. There are different versions of the formula, among

which the most general is Levy’s inversion formula.

Theorem 7.19. (Levy’s inversion formula) Let φY (t) be the characteristic

function of a random variable Y . For a < b,

P(a < Y < b) +1

2[P(Y = a) + P(Y = b)] =

1

2πlimT→∞

∫ T

−T

e−ita − e−itb

itφY (t)dt.

See Durrett [2010, Chap. 2.3] for proof. For the random variable Q defined

193

in (7.21), if λ1 ≥ · · · ≥ λp > 0, then we have

P(Q < c) =1

2π

∫ ∞−∞

1− e−itc

itφQ(t)dt.

This provides a way to numerically computing the tail probability of a linear

combination of χ21 random variables. A more convenient method is to use Gil-

Pelaez’s inversion formula, which can be directly derived from Levy’s inversion

formula. See the original paper, Gil-Pelaez [1951], for proof.

Theorem 7.20. (Gil-Pelaez’s inversion formula) Let φY (t) be the character-

istic function of a continuous random variable Y . We have

FY (y) =1

2− 1

2π

∫ ∞−∞

=[e−ityφY (t)]

tdt,

where = means to extract the imaginary part.

Davies [1973] showed that, using Gil-Pelaez’s inversion formula,

1

2− 1

π

∞∑k=0

=[φY ((k + 1/2)∆)e−i(k+1/2)∆y]

k + 1/2

=P(Y < y) +∞∑n=1

(−1)nP(Y < y − 2πn/∆)− P(Y > y + 2πn/∆).

Hence, we may numerically compute the distribution function of Y by

P(Y < y) ≈ 1

2− 1

π

K∑k=0

=[φY ((k + 1/2)∆)e−i(k+1/2)∆y]

k + 1/2. (7.22)

There are two sources of error. First, we have omitted the term

∞∑n=1

(−1)nP(Y < y − 2πn/∆)− P(Y > y + 2πn/∆).

194

But this term could be made arbitrarily small by choosing an appropriate ∆.

Second, in the summation in (7.22), there is a truncation error since we sum up

to k = K instead of k = ∞. In a later paper [Davies, 1980], Davies showed how

to control this truncation error when Y is a linear combination of independent χ21

random variables.

7.4.2 Methods for Computing the Bounds for the

P-values

Consider the random variable Q defined in (7.21). We now discuss methods for

computing the lower and the upper bounds for P(Q > c). Assume λ1 ≥ · · · ≥

λp > 0. They are used to estimate the true p-value P(Q > c) and hence should be

very easy to evaluate.

First clearly, the upper bound can be computed by

P(Q > c) ≤ P(λ1χ2p > c). (7.23)

Similarly, we have P(Q > c) ≥ P(λkχ2k > c) for k = 1, . . . , p. Thus the lower

bound can be computed by

P(Q > c) ≥ max1≤k≤p

P(λkχ2k > c). (7.24)

This method for computing the bounds is extremely fast to evaluate however

the accuracy might be very bad if p is large and the weights cover a very wide

range. Assuming p is even, we now describe a better for computing the bounds.

195

Since λ1 ≥ λ2,

P(λ2χ22 > c) ≤ P(λ1X1 + λ2X2 > c) ≤ P(λ1χ

22 > c).

But χ22 is just an exponential random variable with rate parameter 1/2, of which

the distribution function is very easy to compute and convolve! Let Yki.i.d∼ χ2

2.

The upper bound and the lower bound can be computed by

P(Q > c) ≤ P(

p/2∑k=1

λ2k−1Yk > c);

P(Q > c) ≥ P(

p/2∑k=1

λ2kYk > c).

(7.25)

To see why the convolution is fast, let’s start from the simplest case, p = 2.

P(λ1Y1 > c) = e−c/2λ1 .

Next consider p = 4.

P(λ1Y1 + λ3Y2 > c) = 1−∫ c/λ3

0

fχ22(y)P(λ1Y1 ≤ c− λ3y)dy

=λ1

λ1 − λ3

e−c/2λ1 +λ3

λ3 − λ1

e−c/2λ3

Proceed to p = 6.

P(λ1Y1 + λ3Y2 + λ5Y3 > c) =1−∫ c/λ5

0

fχ22(y)P(λ1Y1 + λ3Y3 ≤ c− λ5y)dy

=λ2

1 e−c/2λ1

(λ1 − λ3)(λ1 − λ5)+

λ23 e−c/2λ3

(λ3 − λ1)(λ3 − λ5)+

λ25 e−c/2λ5

(λ5 − λ1)(λ5 − λ3).

196

It is easy to generalize to any even p > 0. Letting r = p/2, we have

P(r∑

k=1

λ2k−1Yk > c) =r∑

k=1

λr−12k−1e

−c/2λ2k−1∏j 6=k

(λ2k−1 − λ2j−1).

Hence, the bounds given in (7.25) are also very fast to evaluate.

7.5 GCTA and Linear Mixed Model

Linear mixed model is used by GCTA [Yang et al., 2010, Lee et al., 2011, Yang

et al., 2011] to infer heritability for phenotypes. The model may be written as

y = Xβ +Wu+ ε , ε ∼ MVN(0, σ2εI) , u ∼ MVN(0, σ2

uI), (7.26)

where X is an n × q matrix and W is an n × N matrix by a slight abuse of

notations. β is called fixed effects and u is called random effects. Equivalently,

we can write

y = Xβ + g + ε ∼ MVN(Xβ, V ), (7.27)

where

V = σ2εH , H = κA+ I, A =

1

NWW t.

Hence, implicitly we have used κ = Nσ2u/σ

2ε . The model can be easily generalized

to

y = Xβ +r∑i=1

gi + ε , gi ∼ MVN(0, κiσ2εAi).

197

For ease of reading, this introduction is restricted to the one-random-effect model (7.27).

GCTA estimates the variance components, , σ2ε and σ2

u, by REML (restrict-

ed/residual maximum likelihood), which is the most classical approach for statis-

tical inferences with linear mixed model. For a thorough treatment in book form,

see Jiang [2007] and Searle et al. [2009].

7.5.1 Restricted Maximum Likelihood Estimation

To have an intuition for REML estimation, recall that for n observations y1, . . . , yn,

the sample variance is given by s2 = SST/(n − 1), where SST =∑

(yi − y)2

denotes the total sum of squares. It can be shown that s2 is indeed unbiased.

However, if we assume a normal distribution for the observations and compute a

maximum likelihood estimator, it would be σ2ML = SST/n, which is biased. The

reason is that we have one “fixed effect”, the expectation of yi (denoted by µy),

which we estimate by y. This estimation, intuitively speaking, costs one degree

of freedom and thus we prefer the estimator s2. REML, in contrast, aims to

construct a likelihood function that does not contain µy, and then by maximizing

the restricted likelihood we obtain an unbiased estimator for the variance.

For linear mixed model (7.26), the idea of REML is to find an n × q full-rank

matrix L1 and an n× (n− q) full-rank matrix L2 such that

Lt1X = I , Lt2X = 0,

and make inferences using y1 = Lt1y and y2 = Lt2y. By (7.27),

y1

y2

∼ MVN(

β0

, Lt1V L1 Lt1V L2

Lt2V L1 Lt2V L2

.)

198

The matrix L2 is used to control for the loss of the degree of freedoms caused

by X. Now define the restricted likelihood as the likelihood for κ and σ2ε given

observing y2. We write the log-restricted-likelihood as

lrdef.= logL(σ2

ε , κ;y2)

= −n− q2

log(2π)− 1

2log |Lt2V L2| −

1

2ytL2(Lt2V L2)−1Lt2y.

Assuming X has full rank, we have by Lemma 7.16

PHdef.= H−1 −H−1X(X tH−1X)−1X tH−1 = L2(Lt2HL2)−1Lt2. (7.28)

The determinant term can also be transformed so that L2 does not enter the

calculation of the derivatives. By Lemma 7.3,

|LtHL| =|Lt2HL2||Lt1HL1 −Lt1HL2(Lt2HL2)−1Lt2HL1|,

where L = (L1,L2). By (7.28), the second term on the r.h.s. is equal to

|(X tH−1X)−1|. Thus,

log |LtHL| = log |Lt2HL2| − log |X tH−1X|.

Since both L and H are square matrices of full rank,

log |LtHL| = log |H|+ log |LtL|,

which yields

log |Lt2HL2| = log |H|+ log |LtL|+ log |X tH−1X|.

199

Now omitting the constant term, we can rewrite the log-restricted-likelihood as

lr = −1

2((n− q) log σ2

ε + log |H|+ log |X tH−1X|+ ytPHy/σ2ε). (7.29)

To compute the REML estimates for κ and σ2ε , we need differentiate lr. For σ2

ε ,

we have

∂lr∂σ2

ε

= −1

2(n− qσ2ε

− ytPHy

σ4ε

).

For κ, using the matrix differentiation rule∂ log |M |

∂x= tr(M−1∂M

∂x), we obtain

∂lr∂κ

= −1

2[tr(PHA)− 1

σ2ε

ytPHAPHy]

after heavy calculations. The REML estimates for κ and σ2ε , which are unbiased,

are then obtained by solving

∂lr∂σ2

ε

∣∣∣∣σ2ε

= 0 ,∂lr∂κ

∣∣∣∣κ

= 0. (7.30)

However, this cannot be solved analytically (note that PH depends on the param-

eter κ).

7.5.2 Newton-Raphson’s Method for Computing REML

Estimates

To solve (7.30), the standard approach is Newton-Raphson’s optimization method [Chong

and Zak, 2013, Chap. 9]. Let θ = (κ, σ2ε). We start from an initial guess θ(0) and

200

then update it by

θ(k+1) = θ(k) −[∂2lr∂θθt

]−1∂lr∂θ

,

where∂2lr∂θθt

is called Hessian matrix. In most statistical applications, the Hessian

matrix is computed or estimated by either the observed Fisher information ma-

trix, J , or the expected Fisher information matrix, I. The correspond iteration

formulae are given by

θ(k+1) = θ(k) + J (θ)−1∂lr∂θ

;

θ(k+1) = θ(k) + I(θ)−1∂lr∂θ

.

For our problem, we can calculate

Jσ2εσ

2ε

= − ∂2lr∂(σ2

ε)2

=1

2

(−n− q

σ4ε

+2

σ6ε

ytPHy

),

Jκκ = −∂2lr∂κ2

=1

2

[− tr(PHAPHA) +

2

σ2ε

ytPHAPHAPHy

],

Jκσ2ε

= − ∂2lr∂κ∂σ2

ε

=1

2

(1

σ4ε

ytPHAPHy

),

Iσ2εσ

2ε

= −E[∂2lr∂(σ2

ε)2

]=

1

2

(n− qσ4ε

),

Iκκ = −E[∂2lr∂κ2

]=

1

2tr(PHAPHA),

Iκσ2ε

= −E[∂2lr∂κ∂σ2

ε

]=

1

2

(1

σ2ε

tr(PHA)

).

Noticing that Jσ2εσ

2ε

and Iσ2εσ

2ε

( Jκκ and Iκκ) contain the same term with opposite

signs, in practice we use the average information matrix [Gilmour et al., 1995],

201

which is more convenient to compute,

AI(σ2ε , κ) =

1

2

1

σ6ε

ytPHy1

σ4ε

ytPHAPHy

1

σ4ε

ytPHAPHy1

σ2ε

ytPHAPHAPHy

.The diagonals are the averages of those of J and I . But the off-diagonals are

simply chosen equal to those of J so that AI is positive definite.

7.5.3 Details of GCTA’s Implementation of REML

Estimations

Parametrization GCTA uses the parametrization (σ2ε , σ

2g) where

σ2g = κσ2

ε = Nσ2u.

By defining

Pdef.= σ−2

ε PH = V −1 − V −1X(X tV −1X)−1X tV −1,

we have

∂lr∂σ2

g

=− 1

2

[tr(PA)− ytPAPy

],

AI(σ2ε , σ

2g) =

1

2

ytPPPy ytPAPPy

ytPPAPy ytPAPAPy

.Then the REML estimates for σ2

ε and σ2g are computed by Newton-Raphson’s

method.

202

Standard Errors of the Estimates Since maximum likelihood estimates are

asymptotically normal with the covariance matrix I−1 and AI is a consistent

estimator for I, we may compute the standard errors for the estimates, σ2ε and

σ2g , by calculating

AI−1 =

AI11 AI12

AI21 AI22

.Then, the standard errors can be computed by

SE(σ2ε) =AI11, SE(σ2

g) = AI22.

Heritability Estimation In genome-wide association studies, the matrix W is

composed of the dosages of the SNPs. If each column of W is normalized to unit

variance, then the heritability of the phenotype y can be estimated by

h2 =σ2g

σ2ε + σ2

g

=κ

1 + κ.

GCTA call this “variance explained by the genome-wide SNPs”. Define σ2p =

σ2ε +σ2

g . Clearly σ2p = σ2

ε + σ2g is also unbiased. To compute the standard error for

h2, we use the first-order Taylor expansion,

Var(σ2g

σ2p

) ≈σ2g

σ2p

Var(σ2

g)

σ4g

− 2Cov(σ2

g , σ2p)

σ2gσ

2p

+Var(σ2

p)

σ4p

,

203

where

Var(σ2g) =AI22,

Var(σ2p) =AI11 +AI22 + 2AI12,

Cov(σ2g , σ

2p) =AI22 +AI12.

Calculation of the P-value The p-value of GCTA output is calculated by the

(restricted) likelihood ratio test. The null hypothesis is σ2g = 0 and the alternative

is σ2g > 0. To calculate the maximum log-restricted-likelihood of the alternative

hypothesis, denoted by lr1, we simply plug the REML estimates, σ2ε and σ2

g into

(7.29). Similarly, the maximum log-restricted-likelihood of the null hypothesis,

denoted by lr0, can be computed by plugging in

σ2g0 = 0, σ2

ε0 =∑

(yi − y)2/(n− q).

Note that this is not a standard setting for the likelihood ratio test. This is

because the null hypothesis lies at the boundary of the parameter space and thus

the standard asymptotic result for the likelihood ratio test does not apply. Stram

and Lee [1994] showed that asymptotically −2(lr0 − lr1) follows 0.5δ0 + 0.5χ21 (δ0

denotes a degenerate distribution with unit probability at 0), which was in fact a

result from Self and Liang [1987]. Hence GCTA computes the p-value by

P =1

2Pr(χ2

1 > −2(lr0 − lr1)). (7.31)

However, this asymptotic result is likely to perform poorly for a finite sample size.

Methods proposed by Crainiceanu and Ruppert [2004], Greven et al. [2012] may

be considered to produce more reliable p-values.

204

BLUP Estimations In most applications, we do not estimate u but only esti-

mate the variance of the random effect. If one does want to estimate u or g = Wu,

the standard choice is to use the BLUPs (Best Linear Unbiased Predictors). The

BLUPs have very similar statistical properties to the BLUEs (Best Linear Un-

biased Estimators) in linear regression and why they are called “predictors” is

not very clear. The idea of the BLUP estimation is to compute the conditional

expectation given y2. For example, if we want to estimate a quantity a, which

follows a normal distribution, we use

a = E[a|L′2y] = Cov(a,Lt2y) Var(Lt2y)−1Lt2y

Thus for u, g and ε, we have

g =σ2gAPy = κAPHy,

ε =σ2εPy = PHy,

u =σ2gW

tPy/N.

(7.32)

But notice that in the GCTA output, ui is rescaled by√

2fi(1− fi) where fi is

the minor allele frequency of the i-th SNP, so that it can be directly applied to

the unscaled genotype data.

7.6 Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm [Hastings, 1970] is probably the most impor-

tant example of Markov chain Monte Carlo (MCMC) methods. As suggest by the

name of MCMC, it is a sampling method based on Markov chain. The readers

are referred to Ross [1996], Levin et al. [2009] and Meyn and Tweedie [2012] for

an introduction to Markov chain. The following result from Markov chain theory

205

is key to the understanding of the Metropolis-Hastings algorithm.

Proposition 7.21. (Detailed balance condition) Let P be the transition matrix

of a Markov chain with a countable state space Ω. If a distribution π satisfies the

detailed balance condition,

π(x)P (x, y) = π(y)P (y, x), ∀x, y ∈ Ω,

then π is a stationary distribution for P .

Proof. By the detailed balance condition, ∀y ∈ Ω,

∑x∈Ω

π(x)P (x, y) =∑x∈Ω

π(y)P (y, x) = π(y).

Treating π as a row vector, we may write πP = π, which is the definition of the

stationary distribution.

For a general state space S, the detailed balance condition is given by [Green

and Mira, 2001]

∫(x,y)∈A×B

π(dx)P (x, dy) =

∫(x,y)∈A×B

π(dy)P (dx, y), ∀ Borel sets A,B ⊂ S.

(7.33)

We are now ready to formalize the Metropolis-Hastings algorithm.

Proposition 7.22. (Metropolis-Hastings algorithm) Consider a countable state

space Ω and a transition matrix Q (but we will call it the proposal matrix) such

that

• Q is irreducible and aperiodic;

206

• if Q(x, y) > 0, then Q(y, x) > 0.

Let π be a probability distribution on Ω. The Metropolis-Hastings algorithm defines

a Markov chain that starts from x(0) such that π(x(0)) > 0 and moves according to

the following rule:

• Given the current state x(k), propose a new state y according to the distri-

bution Q(x, ·).

• Compute the acceptance ratio

α(x, y) = min1, π(y)Q(y, x)

π(x)Q(x, y). (7.34)

• Set x(k+1) = y with probability α(x, y), and set x(k+1) = x(k) with probability

1− α(x, y).

Then π is the unique stationary and limiting distribution for this Markov chain.

Proof. We start the proof by checking the detailed balance condition. Let P be

the actual transition matrix of the Metropolis-Hastings Markov chain. For any

x, y ∈ Ω, clearly one of α(x, y) and α(y, x) must be greater than or equal to 1.

Assume α(x, y) = π(y)Q(y, x)/π(x)Q(x, y) and α(y, x) = 1. Then,

π(x)P (x, y) = π(x)Q(x, y)α(x, y) = π(x)Q(x, y)π(y)Q(y, x)

π(x)Q(x, y)

= π(y)Q(y, x) = π(y)P (y, x).

By Proposition 7.22, π must be a stationary distribution for P . Let Ω+ = x ∈

Ω : π(x) > 0. Since α(x, y) is always greater than 0 and Q is irreducible on

Ω, P is irreducible on Ω+. Since Q is aperiodic, P is also aperiodic. By the

207

standard Markov chain theory and in particular, the ergodic theorem [Durrett,

2010, Chap. 7], π is both the stationary and the limiting distribution for P .

We make two comments. First, the aperiodicity of Q is not necessary at all.

As long as for some x, y ∈ Ω+ we have Q(x, y) > 0 and α(x, y) ∈ (0, 1) , P must

be aperiodic since the chain may stay at x for any positive number of steps with a

positive probability. Second, there are many other choices of the acceptance ratio

such that the detailed balance condition holds. However, Peskun [1973] proved

that, for a discrete state space, the acceptance ratio given in (7.34) is optimal in

terms of statistical efficiency. This ratio is also called Metropolis-Hastings ratio

or simply Hastings ratio.

For variations of the Metropolis-Hastings algorithm, see Liu [2008] and Brooks

et al. [2011] among others.

7.7 Used Real Datasets

7.7.1 Merged Intraocular Pressure Dataset

We applied for access and downloaded two GWAS datasets from the database of

Genotypes and Phentoypes (dbGaP). Both studies were funded by the National

Eye Institute. One is the Ocular Hypertension Treatment Study [Kass et al., 2002]

(henceforth OHTS, dbGaP accession number: phs000240.v1.p1), and the other

is National Eye Institute Human Genetics Collaboration Consortium Glaucoma

Genome-Wide Association Study [Ulmer et al., 2012] (henceforth NEIGHBOR,

dbGaP accession number: phs000238.v1.p1). The phenotype of interest is the in-

traocular pressure (IOP). The OHTS dataset only contains individuals with high

IOP (≥ 21). The NEIGHBOR dataset is a case-control design for glaucoma [Ulmer

208

et al., 2012, Weinreb et al., 2014], in which many samples have IOP measurements,

because a high IOP is considered a major risk factor for glaucoma. The NEIGH-

BOR dataset, however, contains case samples with small IOP and control samples

with large IOP. To reduce the effect of any potentially confounding factors, we

removed those samples. We also removed samples whose IOP measurements differ

by more than 10 between the two eyes since such a large difference is likely to

be caused by physical accidents. We noticed that there is an additional column

defined as I(max IOP > 21) in the original phenotype file. However, this column

conflicts with the IOP measurements of the two eyes for some samples. Such sam-

ples were removed as well. The average IOP of the two eyes was used as the raw

phentoypes.

We then performed the routine quality control for the genotypes using the

procedures described in Xu and Guan [2014]. OHTS and NEIGHBOR were geno-

typed on different SNP arrays and finally 301, 143 SNPs genotyped in both studies

passed the quality control. We then performed principal component analysis to

remove outliers and extracted 3, 226 subjects (740 from OHTS and 2486 from

NEIGHBOR) that were clustered around European samples in HapMap3 [The

International HapMap Consortium, 2010].

We refer to this dataset as the IOP dataset in this manuscript. In the simu-

lation study of the p-value calibration (Chap. 2.3.4), we only used the genotype

and the phenotype was simulated.

7.7.2 Height Dataset

The Height dataset refers to the dataset used in Yang et al. [2010], from which

GCTA estimated a 44.6% heritability for height. It contains 3, 925 subjects and

294, 831 SNPs. All the individuals have European descent and are unrelated with

209

each other. Hence there is no need to control for population stratification. Strict

quality control procedures have already been performed (see Yang et al. [2010]).

In our simulation studies, we just removed the SNPs with missing rate > 0.01 or

MAF < 0.01 and 274, 719 SNPs remained.

210

Bibliography

Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions:

with formulas, graphs, and mathematical tables, volume 55. Courier Corpora-

tion, 1964.

Alan Agresti and Maria Kateri. Categorical data analysis. Springer, 2011.

James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychoto-

mous response data. Journal of the American statistical Association, 88(422):

669–679, 1993.

Gregoire Allaire, Sidi Mahmoud Kaber, and Karim Trabelsi. Numerical linear

algebra, volume 55. Springer, 2008.

Hana Lango Allen, Karol Estrada, Guillaume Lettre, Sonja I Berndt, Michael N

Weedon, Fernando Rivadeneira, Cristen J Willer, Anne U Jackson, Sailaja

Vedantam, Soumya Raychaudhuri, et al. Hundreds of variants clustered in

genomic loci and biological pathways affect human height. Nature, 467(7317):

832–838, 2010.

Christophe Andrieu and Gareth O Roberts. The pseudo-marginal approach for

efficient monte carlo computations. The Annals of Statistics, pages 697–725,

2009.

211

Jennifer Asimit and Eleftheria Zeggini. Rare variant association analysis methods

for complex traits. Annual review of genetics, 44:293–308, 2010.

David J Balding. A tutorial on statistical methods for population association

studies. Nature Reviews Genetics, 7(10):781–791, 2006.

Roderick D Ball. Bayesian methods for quantitative trait loci mapping based on

model selection: approximate analysis using the bayesian information criterion.

Genetics, 159(3):1351–1364, 2001.

Yael Baran, Bogdan Pasaniuc, Sriram Sankararaman, Dara G Torgerson, Christo-

pher Gignoux, Celeste Eng, William Rodriguez-Cintron, Rocio Chapela, Jean G

Ford, Pedro C Avila, et al. Fast and accurate inference of local ancestry in latino

populations. Bioinformatics, 28(10):1359–1367, 2012.

Gregory S Barsh, Gregory P Copenhaver, Greg Gibson, and Scott M Williams.

Guidelines for genome-wide association studies. PLoS genetics, 8(7):e1002812,

2012.

Maurice S Bartlett. Properties of sufficiency and statistical tests. Proceedings

of the Royal Society of London. Series A, Mathematical and Physical Sciences,

pages 268–282, 1937.

Maurice S Bartlett. A comment on D. V. Lindley’s statistical paradox. Biometrika,

44(1-2):533–534, 1957.

Johannes Bausch. On the efficient calculation of a linear combination of chi-

square random variables with an application in counting string vacua. Journal

of Physics A: Mathematical and Theoretical, 46(50):505202, 2013.

212

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a prac-

tical and powerful approach to multiple testing. Journal of the Royal Statistical

Society. Series B (Methodological), pages 289–300, 1995.

James O Berger and Luis R Pericchi. The intrinsic bayes factor for linear models.

Bayesian statistics, 5:25–44, 1996a.

James O Berger and Luis R Pericchi. The intrinsic bayes factor for model selection

and prediction. Journal of the American Statistical Association, 91(433):109–

122, 1996b.

James O Berger and Thomas Sellke. Testing a point null hypothesis: the irreconcil-

ability of p values and evidence. Journal of the American statistical Association,

82(397):112–122, 1987.

James O Berger, Luis R Pericchi, JK Ghosh, Tapas Samanta, Fulvio De Santis,

JO Berger, and LR Pericchi. Objective bayesian methods for model selection:

introduction and comparison. Lecture Notes-Monograph Series, pages 135–207,

2001.

Peter J Bickel and JK Ghosh. A decomposition for the likelihood ratio statistic

and the bartlett correction–a bayesian argument. The Annals of Statistics, 18

(3):1070–1090, 1990.

Joanna M Biernacka, Rui Tang, Jia Li, Shannon K McDonnell, Kari G Rabe,

Jason P Sinnwell, David N Rider, Mariza De Andrade, Ellen L Goode, and

Brooke L Fridley. Assessment of genotype imputation methods. In BMC pro-

ceedings, volume 3, page 1. BioMed Central, 2009.

Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006.

213

David Blackwell. Conditional expectation and unbiased sequential estimation. The

Annals of Mathematical Statistics, pages 105–110, 1947.

George EP Box. A general distribution theory for a class of likelihood criteria.

Biometrika, 36(3/4):317–346, 1949.

Karl W Broman and Terence P Speed. A model selection approach for the identi-

fication of quantitative trait loci in experimental crosses. Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 64(4):641–656, 2002.

Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of

Markov Chain Monte Carlo. CRC press, 2011.

Brian L Browning and Sharon R Browning. A unified approach to genotype impu-

tation and haplotype-phase inference for large data sets of trios and unrelated

individuals. The American Journal of Human Genetics, 84(2):210–223.

Paul R Burton, David G Clayton, Lon R Cardon, Nick Craddock, Panos Deloukas,

Audrey Duncanson, Dominic P Kwiatkowski, Mark I McCarthy, Willem H

Ouwehand, Nilesh J Samani, et al. Genome-wide association study of 14,000

cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):

661–678, 2007.

William S Bush and Jason H Moore. Genome-wide association studies. PLoS

Comput Biol, 8(12):e1002822, 2012.

Peter Carbonetto and Matthew Stephens. Scalable variational inference for

bayesian variable selection in regression, and its accuracy in genetic associa-

tion studies. Bayesian analysis, 7(1):73–108, 2012.

Bradley P Carlin and Siddhartha Chib. Bayesian model choice via markov chain

214

monte carlo methods. Journal of the Royal Statistical Society. Series B (Method-

ological), pages 473–484, 1995.

George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury

Pacific Grove, CA, 2002.

George Casella and Christian P Robert. Rao-blackwellisation of sampling schemes.

Biometrika, 83(1):81–94, 1996.

Zhen Chen and David B Dunson. Random effects selection in linear mixed models.

Biometrics, 59(4):762–769, 2003.

Edwin KP Chong and Stanislaw H Zak. An introduction to optimization, vol-

ume 76. John Wiley & Sons, 2013.

Francis S Collins, Mark S Guyer, and Aravinda Chakravarti. Variations on a

theme: cataloging human dna sequence variation. Science, 278(5343):1580–

1581, 1997.

Psychiatric GWAS Consortium Coordinating Committee. Genomewide associa-

tion studies: history, rationale, and prospects for psychiatric disorders. Ameri-

can Journal of Psychiatry, 2009.

Karen N Conneely and Michael Boehnke. So many correlated tests, so little time!

rapid adjustment of p values for multiple correlated tests. The American Journal

of Human Genetics, 81(6):1158–1168, 2007.

EH Corder, AM Saunders, WJ Strittmatter, DE Schmechel, PC Gaskell, GWet

Small, AD Roses, JL Haines, and Margaret A Pericak-Vance. Gene dose of

apolipoprotein e type 4 allele and the risk of alzheimers disease in late onset

families. Science, 261(5123):921–923, 1993.

215

Dennis D. Cox. The Theory of Statistics and Its Applications. 2004. unpublished

book.

Ciprian M Crainiceanu and David Ruppert. Likelihood ratio tests in linear mixed

models with one variance component. Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 66(1):165–185, 2004.

Paul Damien and Stephen G Walker. Sampling truncated normal, beta, and

gamma densities. Journal of Computational and Graphical Statistics, 10(2):

206–215, 2001.

Robert B Davies. Numerical inversion of a characteristic function. Biometrika, 60

(2):415–417, 1973.

Robert B Davies. Algorithm as 155: The distribution of a linear combination of

χ 2 random variables. Applied Statistics, pages 323–333, 1980.

Paul IW de Bakker, Roman Yelensky, Itsik Pe’er, Stacey B Gabriel, Mark J Daly,

and David Altshuler. Efficiency and power in genetic association studies. Nature

genetics, 37(11):1217–1223, 2005.

Olivier Delaneau, Cedric Coulonges, and Jean-Francois Zagury. Shape-it: new

rapid and accurate algorithm for haplotype inference. BMC bioinformatics, 9

(1):1, 2008.

Petros Dellaportas, Jonathan J Forster, and Ioannis Ntzoufras. On bayesian model

and variable selection using mcmc. Statistics and Computing, 12(1):27–36, 2002.

Aad W der Vaart. Asymptotic statistics, volume 3. Cambridge university press,

2000.

B Devlin and Neil Risch. A comparison of linkage disequilibrium measures for

fine-scale mapping. Genomics, 29(2):311–322, 1995.

216

B Devlin, Kathryn Roeder, and Larry Wasserman. Genomic control, a new ap-

proach to genetic-based association studies. Theoretical population biology, 60

(3):155–166, 2001.

Bernie Devlin and Kathryn Roeder. Genomic control for association studies. Bio-

metrics, 55(4):997–1004, 1999.

Randal Douc and Christian P Robert. A vanilla rao–blackwellization of

metropolis–hastings algorithms. The Annals of Statistics, 39(1):261–277, 2011.

Norman R Draper and R Craig Van Nostrand. Ridge regression and james-stein

estimation: review and comments. Technometrics, 21(4):451–466, 1979.

Frank Dudbridge and Arief Gusnanto. Estimation of significance thresholds for

genomewide association scans. Genetic epidemiology, 32(3):227–234, 2008.

Richard H Duerr, Kent D Taylor, Steven R Brant, John D Rioux, Mark S Sil-

verberg, Mark J Daly, A Hillary Steinhart, Clara Abraham, Miguel Regueiro,

Anne Griffiths, et al. A genome-wide association study identifies il23r as an

inflammatory bowel disease gene. science, 314(5804):1461–1463, 2006.

Rick Durrett. Probability: theory and examples. Cambridge university press, 2010.

Douglas F Easton, Karen A Pooley, Alison M Dunning, Paul DP Pharoah, Deb-

orah Thompson, Dennis G Ballinger, Jeffery P Struewing, Jonathan Morrison,

Helen Field, Robert Luben, et al. Genome-wide association study identifies

novel breast cancer susceptibility loci. Nature, 447(7148):1087–1093, 2007.

Albert O Edwards, Robert Ritter, Kenneth J Abel, Alisa Manning, Carolien Pan-

huysen, and Lindsay A Farrer. Complement factor h polymorphism and age-

related macular degeneration. Science, 308(5720):421–424, 2005.

217

Lars Elden. Algorithms for the regularization of ill-conditioned least squares prob-

lems. BIT Numerical Mathematics, 17(2):134–145, 1977.

William Feller. An introduction to probability theory and its applications: volume

II, volume 3. John Wiley & Sons London-New York-Sydney-Toronto, 1968.

Ronald A Fisher. Xv.the correlation between relatives on the supposition of

mendelian inheritance. Transactions of the royal society of Edinburgh, 52(02):

399–433, 1919.

Charles W Fox and Stephen J Roberts. A tutorial on variational bayesian infer-

ence. Artificial intelligence review, 38(2):85–95, 2012.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statis-

tical learning, volume 1. Springer series in statistics Springer, Berlin, 2001.

Francis Galton. Natural inheritance. Macmillan, 1894.

Eric R Gamazon, Heather E Wheeler, Kaanan P Shah, Sahar V Mozaffari, Keston

Aquino-Michaels, Robert J Carroll, Anne E Eyler, Joshua C Denny, Dan L

Nicolae, Nancy J Cox, et al. A gene-based association method for mapping

traits using reference transcriptome data. Nature genetics, 47(9):1091–1098,

2015.

Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian

data analysis, volume 2. Chapman & Hall/CRC Boca Raton, FL, USA, 2014.

Edward I George and Robert E McCulloch. Variable selection via gibbs sampling.

Journal of the American Statistical Association, 88(423):881–889, 1993.

Edward I George and Robert E McCulloch. Approaches for bayesian variable

selection. Statistica sinica, pages 339–373, 1997.

218

M Gielen, PJ Lindsey, Catherine Derom, HJM Smeets, NY Souren, ADC

Paulussen, R Derom, and JG Nijhuis. Modeling genetic and environmental

factors to increase heritability and ease the identification of candidate genes for

birth weight: a twin study. Behavior genetics, 38(1):44–54, 2008.

J Gil-Pelaez. Note on the inversion theorem. Biometrika, 38(3-4):481–482, 1951.

Arthur R Gilmour, Robin Thompson, and Brian R Cullis. Average information

reml: an efficient algorithm for variance parameter estimation in linear mixed

models. Biometrics, pages 1440–1450, 1995.

Simon J Godsill. On the relationship between MCMC model uncertainty methods.

Cambridge University, Engineering, Department, 1998.

Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU

Press, 2012.

IJ Good. The bayes/non-bayes compromise: A brief review. Journal of the Amer-

ican Statistical Association, 87(419):597–606, 1992.

Steven N Goodman. Toward evidence-based medical statistics. 2: The bayes

factor. Annals of internal medicine, 130(12):1005–1013, 1999.

Brian Gough. GNU scientific library reference manual. Network Theory Ltd.,

2009.

Peter J Green. Reversible jump markov chain monte carlo computation and

bayesian model determination. Biometrika, 82(4):711–732, 1995.

Peter J Green and Antonietta Mira. Delayed rejection in reversible jump

metropolis–hastings. Biometrika, 88(4):1035–1053, 2001.

219

Sonja Greven, Ciprian M Crainiceanu, Helmut Kuchenhoff, and Annette Peters.

Restricted likelihood ratio testing for zero variance components in linear mixed

models. Journal of Computational and Graphical Statistics, 2012.

Justin Grimmer. An introduction to bayesian inference via variational approxi-

mations. Political Analysis, 19(1):32–47, 2011.

Yongtao Guan and Stephen M Krone. Small-world mcmc and convergence to

multi-modal distributions: From slow mixing to fast mixing. The Annals of

Applied Probability, 17(1):284–304, 2007.

Yongtao Guan and Matthew Stephens. Practical issues in imputation-based asso-

ciation mapping. PLoS Genetics, 4(12):e1000279, 2008.

Yongtao Guan and Matthew Stephens. Bayesian variable selection regression for

genome-wide association studies and other large-scale problems. The Annals of

Applied Statistics, pages 1780–1815, 2011.

Jonathan L Haines, Michael A Hauser, Silke Schmidt, William K Scott, Lana M

Olson, Paul Gallins, Kylee L Spencer, Shu Ying Kwan, Maher Noureddine,

John R Gilbert, et al. Complement factor h variant increases the risk of age-

related macular degeneration. Science, 308(5720):419–421, 2005.

Godfrey Harold Hardy, John Edensor Littlewood, and George Polya. Inequalities.

Cambridge university press, 1952.

DA Harville. Matrix algebra from a statistician’s perspective. Inc., Springer-

Verlag New York, 1997.

W Keith Hastings. Monte carlo sampling methods using markov chains and their

applications. Biometrika, 57(1):97–109, 1970.

220

Joel N Hirschhorn and Mark J Daly. Genome-wide association studies for common

diseases and complex traits. Nature Reviews Genetics, 6(2):95–108, 2005.

James P Hobert and George Casella. The effect of improper priors on gibbs

sampling in hierarchical linear mixed models. Journal of the American Statistical

Association, 91(436):1461–1473, 1996.

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for

nonorthogonal problems. Technometrics, 12(1):55–67, 1970a.

Arthur E Hoerl and Robert W Kennard. Ridge regression: applications to

nonorthogonal problems. Technometrics, 12(1):69–82, 1970b.

Peter D Hoff. A first course in Bayesian statistical methods. Springer Science &

Business Media, 2009.

Leslie Hogben. Handbook of linear algebra. CRC Press, 2006.

Clive J Hoggart, John C Whittaker, Maria De Iorio, and David J Balding. Si-

multaneous analysis of all snps in genome-wide and re-sequencing association

studies. PLoS Genet, 4(7):e1000130, 2008.

F Hoti and MJ Sillanpaa. Bayesian mapping of genotype× expression interactions

in quantitative and qualitative traits. Heredity, 97(1):4–18, 2006.

Bryan N Howie, Peter Donnelly, and Jonathan Marchini. A flexible and accurate

genotype imputation method for the next generation of genome-wide association

studies. PLoS Genet, 5(6):e1000529, 2009.

Xichen Huang, Jin Wang, and Feng Liang. A variational algorithm for bayesian

variable selection. arXiv preprint arXiv:1602.07640, 2016.

Pirro G Hysi, Ching-Yu Cheng, Henriet Springelkamp, Stuart Macgregor, Jes-

sica N Cooke Bailey, Robert Wojciechowski, Veronique Vitart, Abhishek Nag,

221

Alex W Hewitt, Rene Hohn, et al. Genome-wide analysis of multi-ancestry co-

horts identifies new loci influencing intraocular pressure and susceptibility to

glaucoma. Nature genetics, 46(10):1126–1130, 2014.

Joseph G Ibrahim and Purushottam W Laud. On bayesian analysis of general-

ized linear models using jeffreys’s prior. Journal of the American Statistical

Association, 86(416):981–986, 1991.

Iuliana Ionita-Laza, Seunggeun Lee, Vlad Makarov, Joseph D Buxbaum, and Xi-

hong Lin. Sequence kernel association tests for the combined effect of rare and

common variants. The American Journal of Human Genetics, 92(6):841–853,

2013.

H Ishwaran and JS Rao. Bayesian nonparametric mcmc for large variable selection

problems. Unpublished manuscript, 2000.

Hemant Ishwaran and J Sunil Rao. Spike and slab variable selection: frequentist

and bayesian strategies. Annals of Statistics, pages 730–773, 2005.

Hemant Ishwaran and J Sunil Rao. Detecting differentially expressed genes in

microarrays using bayesian model selection. Journal of the American Statistical

Association, 2011.

Anne-Sophie Jannot, Georg Ehret, and Thomas Perneger. P¡ 5× 10−8 has emerged

as a standard of statistical significance for genome-wide association studies.

Journal of clinical epidemiology, 68(4):460–465, 2015.

Harold Jeffreys. The theory of probability. OUP Oxford, 1961.

Jiming Jiang. Linear and generalized linear mixed models and their applications.

Springer Science & Business Media, 2007.

222

Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul.

An introduction to variational methods for graphical models. Machine learning,

37(2):183–233, 1999.

MA Kass, DK Heuer, EJ Higginbotham, and et al. The ocular hypertension treat-

ment study: A randomized trial determines that topical ocular hypotensive med-

ication delays or prevents the onset of primary open-angle glaucoma. Archives

of Ophthalmology, 120(6):701–713, 2002. doi: 10.1001/archopht.120.6.701. URL

+http://dx.doi.org/10.1001/archopht.120.6.701.

Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american

statistical association, 90(430):773–795, 1995.

Sekar Kathiresan, Olle Melander, Candace Guiducci, Aarti Surti, Noel P Burtt,

Mark J Rieder, Gregory M Cooper, Charlotta Roos, Benjamin F Voight, Aki S

Havulinna, et al. Six new loci associated with blood low-density lipoprotein

cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Na-

ture genetics, 40(2):189–197, 2008.

Hormuzd A Katki. Invited commentary: evidence-based evaluation of p values

and bayes factors. American Journal of Epidemiology, 168(4):384–388, 2008.

Riika Kilpikari and Mikko J Sillanpaa. Bayesian analysis of multilocus association

in quantitative and qualitative traits. Genetic epidemiology, 25(2):122–135,

2003.

Robert J Klein, Caroline Zeiss, Emily Y Chew, Jen-Yue Tsai, Richard S Sackler,

Chad Haynes, Alice K Henning, John Paul SanGiovanni, Shrikant M Mane, Su-

san T Mayne, et al. Complement factor h polymorphism in age-related macular

degeneration. Science, 308(5720):385–389, 2005.

223

Femke CC Klouwer, Kevin Berendse, Sacha Ferdinandusse, Ronald JA Wanders,

Marc Engelen, et al. Zellweger spectrum disorders: clinical overview and man-

agement approach. Orphanet journal of rare diseases, 10(1):1, 2015.

Karl-Rudolf Koch. Introduction to Bayesian statistics. Springer Science & Business

Media, 2007.

Lynn Kuo and Bani Mallick. Variable selection for regression models. Sankhya:

The Indian Journal of Statistics, Series B, pages 65–81, 1998.

Lydia Coulter Kwee, Dawei Liu, Xihong Lin, Debashis Ghosh, and Michael P

Epstein. A powerful and flexible multilocus association test for quantitative

traits. The American Journal of Human Genetics, 82(2):386–397, 2008.

Eric S Lander. The new genomics: global views of biology. Science, 274(5287):

536, 1996.

Michael Lavine and Mark J Schervish. Bayes factors: what they are and what

they are not. The American Statistician, 53(2):119–122, 1999.

DN Lawley. A general method for approximating to the distribution of likelihood

ratio criteria. Biometrika, 43(3/4):295–303, 1956.

Sang Hong Lee, Naomi R Wray, Michael E Goddard, and Peter M Visscher. Es-

timating missing heritability for disease from genome-wide association studies.

The American Journal of Human Genetics, 88(3):294–305, 2011.

Seunggeung Lee, Goncalo R Abecasis, Michael Boehnke, and Xihong Lin. Rare-

variant association analysis: study designs and statistical tests. The American

Journal of Human Genetics, 95(1):5–23, 2014.

David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and

mixing times. American Mathematical Soc., 2009.

224

Bingshan Li and Suzanne M Leal. Methods for detecting associations with rare

variants for common diseases: application to analysis of sequence data. The

American Journal of Human Genetics, 83(3):311–321, 2008.

Jiahan Li, Kiranmoy Das, Guifang Fu, Runze Li, and Rongling Wu. The bayesian

lasso for genome-wide association studies. Bioinformatics, 27(4):516–523, 2011.

Feng Liang, Rui Paulo, German Molina, Merlise A Clyde, and Jim O Berger.

Mixtures of g priors for Bayesian variable selection. Journal of the American

Statistical Association, 103(481), 2008. ISSN 0162-1459.

Dennis V Lindley. A statistical paradox. Biometrika, pages 187–192, 1957.

Jun S Liu. Monte Carlo strategies in scientific computing. Springer Science &

Business Media, 2008.

Eugene Lukacs and Edgar P King. A property of the normal distribution. The

Annals of Mathematical Statistics, 25(2):389–394, 1954.

David J Lunn, John C Whittaker, and Nicky Best. A bayesian toolkit for genetic

association studies. Genetic epidemiology, 30(3):231–247, 2006.

Stuart Macgregor, Belinda K Cornes, Nicholas G Martin, and Peter M Visscher.

Bias, precision and heritability of self-reported and clinically measured height

in australian twins. Human genetics, 120(4):571–580, 2006.

Brian K Maples, Simon Gravel, Eimear E Kenny, and Carlos D Bustamante.

Rfmix: a discriminative modeling approach for rapid and robust local-ancestry

inference. The American Journal of Human Genetics, 93(2):278–288, 2013.

Jonathan Marchini, Bryan Howie, Simon Myers, Gil McVean, and Peter Donnelly.

A new multipoint method for genome-wide association studies by imputation

of genotypes. Nature genetics, 39(7):906–913, 2007.

225

Kantilal Varichand Mardia, John T Kent, and John M Bibby. Multivariate anal-

ysis. 1980.

Eden R Martin, Eric H Lai, John R Gilbert, Allison R Rogala, AJ Afshari, John

Riley, KL Finch, JF Stevens, KJ Livak, Brandon D Slotterbeck, et al. Snping

away at complex diseases: analysis of single-nucleotide polymorphisms around

apoe in alzheimer disease. The American Journal of Human Genetics, 67(2):

383–394, 2000.

Mark I McCarthy, Goncalo R Abecasis, Lon R Cardon, David B Goldstein, Julian

Little, John PA Ioannidis, and Joel N Hirschhorn. Genome-wide association

studies for complex traits: consensus, uncertainty and challenges. Nature re-

views genetics, 9(5):356–369, 2008.

THE Meuwissen and ME Goddard. Mapping multiple qtl using linkage disequi-

librium and linkage analysis information and multitrait data. Genet. Sel. Evol,

36:261–279, 2004.

THE Meuwissen, BJ Hayes, and ME Goddard. Prediction of total genetic value

using genome-wide dense marker maps. Genetics, 157(4):1819–1829, 2001.

Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability.

Springer Science & Business Media, 2012.

Alan Miller. Subset selection in regression. CRC Press, 2002.

Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear

regression. Journal of the American Statistical Association, 83(404):1023–1032,

1988.

Jesper Møller, Anthony N Pettitt, R Reeves, and Kasper K Berthelsen. An efficient

226

markov chain monte carlo method for distributions with intractable normalising

constants. Biometrika, 93(2):451–458, 2006.

Richard W Morris and Norman L Kaplan. On the advantage of haplotype analysis

in the presence of multiple disease susceptibility alleles. Genetic epidemiology,

23(3):221–233, 2002.

Alison A Motsinger-Reif, Eric Jorgenson, Mary V Relling, Deanna L Kroetz,

Richard Weinshilboum, Nancy J Cox, and Dan M Roden. Genome-wide asso-

ciation studies in pharmacogenomics: successes and lessons. Pharmacogenetics

and genomics, 23(8):383, 2013.

Iain Murray, Zoubin Ghahramani, and David MacKay. Mcmc for doubly-

intractable distributions. arXiv preprint arXiv:1206.6848, 2012.

Michael Naaman. Almost sure hypothesis testing and a resolution of the jeffreys-

lindley paradox. Electronic Journal of Statistics, 10(1):1526–1550, 2016.

Ilja M Nolte, Andre R de Vries, Geert T Spijker, Ritsert C Jansen, Dumitru

Brinza, Alexander Zelikovsky, and Gerard J te Meerman. Association testing

by haplotype-sharing methods applicable to whole-genome analysis. In BMC

proceedings, volume 1, page S129. BioMed Central, 2007.

Dale R Nyholt. A simple correction for multiple testing for single-nucleotide poly-

morphisms in linkage disequilibrium with each other. The American Journal of

Human Genetics, 74(4):765–769, 2004.

Anthony O’Hagan. Fractional bayes factors for model comparison. Journal of the

Royal Statistical Society. Series B (Methodological), pages 99–138, 1995.

Anthony O’Hagan and Jonathan J Forster. Kendall’s advanced theory of statistics,

volume 2B: Bayesian inference, volume 2. Arnold, 2004.

227

Robert B O’Hara and Mikko J Sillanpaa. A review of bayesian variable selection

methods: what, how and which. Bayesian analysis, 4(1):85–117, 2009.

Jun Ohashi and Katsushi Tokunaga. The power of genome-wide association studies

of complex disease genes: statistical limitations of indirect approaches using snp

markers. Journal of human genetics, 46(8):478–482, 2001.

A Bilge Ozel, Sayoko E Moroi, David M Reed, Melisa Nika, Caroline M Schmidt,

Sara Akbari, Kathleen Scott, Frank Rozsa, Hemant Pawar, David C Musch,

et al. Genome-wide association study and meta-analysis of intraocular pressure.

Human genetics, 133(1):41–57, 2014.

Roman Pahl and Helmut Schafer. Permory: an ld-exploiting permutation test

algorithm for powerful genome-wide association testing. Bioinformatics, 26(17):

2093–2100, 2010.

Orestis A Panagiotou and John PA Ioannidis. What should the genome-wide sig-

nificance threshold be? empirical replication of borderline genetic associations.

International journal of epidemiology, 41(1):273–286, 2012.

Trevor Park and George Casella. The bayesian lasso. Journal of the American

Statistical Association, 103(482):681–686, 2008.

Peter H Peskun. Optimum monte-carlo sampling using markov chains. Biometrika,

60(3):607–612, 1973.

Ronald L Plackett. Some theorems in least squares. Biometrika, 37(1/2):149–157,

1950.

William H Press. Numerical recipes 3rd edition: The art of scientific computing.

Cambridge university press, 2007.

228

Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A

Shadick, and David Reich. Principal components analysis corrects for strat-

ification in genome-wide association studies. Nature genetics, 38(8):904–909,

2006.

Jonathan K Pritchard and Nancy J Cox. The allelic architecture of human disease

genes: common disease–common variant or not? Human molecular genetics, 11

(20):2417–2423, 2002.

Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of pop-

ulation structure using multilocus genotype data. Genetics, 155(2):945–959,

2000.

Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR

Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul IW De Bakker,

Mark J Daly, et al. Plink: a tool set for whole-genome association and

population-based linkage analyses. The American Journal of Human Genet-

ics, 81(3):559–575, 2007.

Adrian E Raftery, David Madigan, and Jennifer A Hoeting. Bayesian model av-

eraging for linear regression models. Journal of the American Statistical Asso-

ciation, 92(437):179–191, 1997.

David E Reich and Eric S Lander. On the allelic spectrum of human disease.

TRENDS in Genetics, 17(9):502–510, 2001.

Sidney I Resnick. A probability path. Springer Science & Business Media, 2013.

Sheldon M Ross. Stochastic processes, volume 2. John Wiley & Sons New York,

1996.

229

Stephen Sawcer. Bayes factors in complex genetics. European Journal of Human

Genetics, 18(7):746–750, 2010.

Paul Scheet and Matthew Stephens. A fast and flexible statistical model for large-

scale population genotype data: applications to inferring missing genotypes and

haplotypic phase. The American Journal of Human Genetics, 78(4):629–644,

2006.

Angelo Scuteri, Serena Sanna, Wei-Min Chen, Manuela Uda, Giuseppe Albai,

James Strait, Samer Najjar, Ramaiah Nagaraja, Marco Orru, Gianluca Usala,

et al. Genome-wide association scan shows genetic variants in the fto gene are

associated with obesity-related traits. PLoS Genet, 3(7):e115, 2007.

Shaun R Seaman and Sylvia Richardson. Equivalence of prospective and retro-

spective models in the bayesian analysis of case-control studies. Biometrika, 91

(1):15–25, 2004.

Shayle R Searle, George Casella, and Charles E McCulloch. Variance components,

volume 391. John Wiley & Sons, 2009.

Vincent Segura, Bjarni J Vilhjalmsson, Alexander Platt, Arthur Korte, Umit

Seren, Quan Long, and Magnus Nordborg. An efficient multi-locus mixed-model

approach for genome-wide association studies in structured populations. Nature

genetics, 44(7):825–830, 2012.

Steven G Self and Kung-Yee Liang. Asymptotic properties of maximum likelihood

estimators and likelihood ratio tests under nonstandard conditions. Journal of

the American Statistical Association, 82(398):605–610, 1987.

Thomas Sellke, M. J Bayarri, and James O Berger. Calibration of p values

for testing precise null hypotheses. The American Statistician, 55(1):62–71,

230

2001. doi: 10.1198/000313001300339950. URL http://dx.doi.org/10.1198/

000313001300339950.

Saunak Sen and Gary A Churchill. A statistical framework for quantitative trait

mapping. Genetics, 159(1):371–387, 2001.

D Serre. Matrices: Theory and Applications. Springer, New York, 2002.

Bertrand Servin and Matthew Stephens. Imputation-based analysis of association

studies: candidate regions and quantitative traits. PLoS Genetics, 3(7):e114,

2007.

Jun Shao. Mathematical Statistics. Springer Texts in Statistics. Springer, second

edition, 2003. ISBN 9780387953823.

Allan R. Shepard, Nasreen Jacobson, J. Cameron Millar, Iok-Hou Pang,

H. Thomas Steely, Charles C. Searby, Val C. Sheffield, Edwin M. Stone, and

Abbot F. Clark. Glaucoma-causing myocilin mutants require the peroxiso-

mal targeting signal-1 receptor (pts1r) to elevate intraocular pressure. Human

Molecular Genetics, 16(6):609–617, 2007. doi: 10.1093/hmg/ddm001. URL

http://hmg.oxfordjournals.org/content/16/6/609.abstract.

Ilya Shlyakhter, Pardis C Sabeti, and Stephen F Schaffner. Cosi2: an efficient

simulator of exact and approximate coalescent with selection. Bioinformatics,

30(23):3427–3429, 2014.

Zbynek Sidak. On multivariate normal probabilities of rectangles: their depen-

dence on correlations. The Annals of Mathematical Statistics, pages 1425–1434,

1968.

Zbynek Sidak. On probabilities of rectangles in multivariate student distributions:

231

their dependence on correlations. The Annals of Mathematical Statistics, pages

169–175, 1971.

Mikko J Sillanpaa and Elja Arjas. Bayesian mapping of multiple quantitative trait

loci from incomplete inbred line cross data. Genetics, 148(3):1373–1388, 1998.

Vaclav Smıdl and Anthony Quinn. The variational Bayes method in signal pro-

cessing. Springer Science & Business Media, 2006.

David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika Van

Der Linde. Bayesian measures of model complexity and fit. Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 64(4):583–639, 2002.

Eli A Stahl, Daniel Wegmann, Gosia Trynka, Javier Gutierrez-Achury, Ron Do,

Benjamin F Voight, Peter Kraft, Robert Chen, Henrik J Kallberg, Fina AS

Kurreeman, et al. Bayesian inference analyses of the polygenic architecture of

rheumatoid arthritis. Nature genetics, 44(5):483–489, 2012.

Matthew Stephens and David J Balding. Bayesian statistical methods for genetic

association studies. Nature Reviews Genetics, 10(10):681–690, 2009.

John D Storey and Robert Tibshirani. Statistical significance for genomewide

studies. Proceedings of the National Academy of Sciences, 100(16):9440–9445,

2003.

Daniel O Stram and Jae Won Lee. Variance components testing in the longitudinal

mixed effects model. Biometrics, pages 1171–1177, 1994.

Wenguang Sun and Tony T Cai. Large-scale multiple testing under dependence.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71

(2):393–424, 2009.

232

Mikael Sunnaker, Alberto Giovanni Busetto, Elina Numminen, Jukka Corander,

Matthieu Foll, and Christophe Dessimoz. Approximate bayesian computation.

PLoS Comput Biol, 9(1):e1002803, 2013.

James Joseph Sylvester. Xxxvii. on the relation between the minor determinants

of linearly equivalent quadratic functions. The London, Edinburgh, and Dublin

Philosophical Magazine and Journal of Science, 1(4):295–305, 1851.

Cajo JF Ter Braak, Martin P Boer, and Marco CAM Bink. Extending xu’s

bayesian model for estimating polygenic effects using markers of the entire

genome. Genetics, 170(3):1435–1438, 2005.

The International HapMap Consortium. Integrating common and rare genetic

variation in diverse human populations. Nature, 467(7311):52–58, 2010.

Gilles Thomas, Kevin B Jacobs, Meredith Yeager, Peter Kraft, Sholom Wacholder,

Nick Orr, Kai Yu, Nilanjan Chatterjee, Robert Welch, Amy Hutchinson, et al.

Multiple loci identified in a genome-wide association study of prostate cancer.

Nature genetics, 40(3):310–315, 2008.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of

the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

Andrej Nikolaevich Tikhonov and Vasiliy Yakovlevich Arsenin. Solutions of ill-

posed problems. 1977.

Michael E Tipping. Bayesian inference: An introduction to principles and practice

in machine learning. In Advanced lectures on machine Learning, pages 41–62.

Springer, 2004.

John A Todd, Neil M Walker, Jason D Cooper, Deborah J Smyth, Kate Downes,

Vincent Plagnol, Rebecca Bailey, Sergey Nejentsev, Sarah F Field, Felicity

233

Payne, et al. Robust associations of four new chromosome regions from genome-

wide analyses of type 1 diabetes. Nature genetics, 39(7):857–864, 2007.

Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam,

1997.

Pekka Uimari and Ina Hoeschele. Mapping-linked quantitative trait loci using

bayesian analysis and markov chain monte carlo algorithms. Genetics, 146(2):

735–743, 1997.

Megan Ulmer, Jun Li, Brian L. Yaspan, Ayse Bilge Ozel, Julia E. Richards,

Sayoko E. Moroi, Felicia Hawthorne, Donald L. Budenz, David S. Friedman,

Douglas Gaasterland, Jonathan Haines, Jae H. Kang, Richard Lee, Paul Lichter,

Yutao Liu, Louis R. Pasquale, Margaret Pericak-Vance, Anthony Realini, Joel S.

Schuman, Kuldev Singh, Douglas Vollrath, Robert Weinreb, Gadi Wollstein,

Donald J. Zack, Kang Zhang, Terri Young, R. Rand Allingham, Janey L. Wiggs,

Allison Ashley-Koch, and Michael A. Hauser. Genome-wide analysis of central

corneal thickness in primary open-angle glaucoma cases in the neighbor and

glaugen consortiathe effects of cct-associated variants on poag risk. Investigative

Ophthalmology & Visual Science, 53(8):4468, 2012. doi: 10.1167/iovs.12-9784.

URL +http://dx.doi.org/10.1167/iovs.12-9784.

Leonieke ME van Koolwijk, Wishal D Ramdas, M Kamran Ikram, Nomdo M

Jansonius, Francesca Pasutto, Pirro G Hysi, Stuart Macgregor, Sarah F Janssen,

Alex W Hewitt, Ananth C Viswanathan, et al. Common genetic determinants

of intraocular pressure and primary open-angle glaucoma. PLoS Genet, 8(5):

e1002611, 2012.

Jon Wakefield. Bayes factors for genome-wide association studies: comparison

with p-values. Genetic Epidemiology, 33(1):79–86, 2009.

234

Gertraud Malsiner Walli. Bayesian variable selection in normal regression models.

PhD thesis, Institut fur Angewandte Statistik, 2010.

Hui Wang, Yuan-Ming Zhang, Xinmin Li, Godfred L Masinde, Subburaman Mo-

han, David J Baylink, and Shizhong Xu. Bayesian shrinkage estimation of

quantitative trait loci parameters. Genetics, 170(1):465–480, 2005.

Michael N Weedon, Hana Lango, Cecilia M Lindgren, Chris Wallace, David M

Evans, Massimo Mangino, Rachel M Freathy, John RB Perry, Suzanne Stevens,

Alistair S Hall, et al. Genome-wide association analysis identifies 20 loci that

influence adult height. Nature genetics, 40(5):575–583, 2008.

RN Weinreb, T Aung, and FA Medeiros. The pathophysiology and treatment of

glaucoma: A review. JAMA, 311(18):1901–1911, 2014. doi: 10.1001/jama.2014.

3192. URL +http://dx.doi.org/10.1001/jama.2014.3192.

Daphna Weissglas-Volkov, Carlos A Aguilar-Salinas, Elina Nikkola, Kerry A

Deere, Ivette Cruz-Bautista, Olimpia Arellano-Campos, Linda Liliana Munoz-

Hernandez, Lizeth Gomez-Munguia, Maria Luisa Ordonez-Sanchez, Prasad

MV Linga Reddy, et al. Genomic study in mexicans identifies a new locus

for triglycerides and refines european lipid loci. Journal of medical genetics, 50

(5):298–308, 2013.

Danielle Welter, Jacqueline MacArthur, Joannella Morales, Tony Burdett, Peggy

Hall, Heather Junkins, Alan Klemm, Paul Flicek, Teri Manolio, Lucia Hindorff,

et al. The nhgri gwas catalog, a curated resource of snp-trait associations.

Nucleic acids research, 42(D1):D1001–D1006, 2014.

Samuel S Wilks. The large-sample distribution of the likelihood ratio for testing

composite hypotheses. The Annals of Mathematical Statistics, 9(1):60–62, 1938.

235

Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong

Lin. Rare-variant association testing for sequencing data with the sequence

kernel association test. The American Journal of Human Genetics, 89(1):82–

93, 2011.

Hanli Xu and Yongtao Guan. Detecting local haplotype sharing and haplotype

association. Genetics, 197(3):823–838, 2014.

Shizhong Xu. Estimating polygenic effects using markers of the entire genome.

Genetics, 163(2):789–801, 2003.

Jian Yang, Beben Benyamin, Brian P McEvoy, Scott Gordon, Anjali K Hen-

ders, Dale R Nyholt, Pamela A Madden, Andrew C Heath, Nicholas G Martin,

Grant W Montgomery, et al. Common snps explain a large proportion of the

heritability for human height. Nature genetics, 42(7):565–569, 2010.

Jian Yang, S Hong Lee, Michael E Goddard, and Peter M Visscher. Gcta: a

tool for genome-wide complex trait analysis. The American Journal of Human

Genetics, 88(1):76–82, 2011.

Shiming Yang and K Gobbert Matthias. The optimal relaxation parameter for the

sor method applied to a classical model problem. Technical report, Technical

Report number TR2007-6, Department of Mathematics and Statistics, Univer-

sity of Maryland, Baltimore County, 2007.

Nengjun Yi. A unified markov chain monte carlo framework for mapping multiple

quantitative trait loci. Genetics, 167(2):967–975, 2004.

Nengjun Yi and Shizhong Xu. Bayesian lasso for quantitative trait loci mapping.

Genetics, 179(2):1045–1055, 2008.

236

Nengjun Yi, Varghese George, and David B Allison. Stochastic search variable

selection for identifying multiple quantitative trait loci. Genetics, 164(3):1129–

1138, 2003.

Nengjun Yi, Brian S Yandell, Gary A Churchill, David B Allison, Eugene J Eisen,

and Daniel Pomp. Bayesian model selection for genome-wide epistatic quanti-

tative trait loci analysis. Genetics, 170(3):1333–1344, 2005.

David Young. Iterative methods for solving partial difference equations of elliptic

type. Transactions of the American Mathematical Society, 76(1):92–111, 1954.

Eleftheria Zeggini, Laura J Scott, Richa Saxena, Benjamin F Voight, Jonathan L

Marchini, Tianle Hu, Paul IW de Bakker, Goncalo R Abecasis, Peter Alm-

gren, Gitte Andersen, et al. Meta-analysis of genome-wide association data and

large-scale replication identifies additional susceptibility loci for type 2 diabetes.

Nature genetics, 40(5):638–645, 2008.

Arnold Zellner. On assessing prior distributions and bayesian regression analysis

with g-prior distributions. Bayesian Inference and Decision Techniques: Essays

in Honor of Bruno De Finetti, 6:233–243, 1986.

Arnold Zellner and Aloysius Siow. Posterior odds ratios for selected regression

hypotheses. Trabajos de estadıstica y de investigacion operativa, 31(1):585–603,

1980.

Quan Zhou, Liang Zhao, and Yongtao Guan. Strong selection at mhc in mexicans

since admixture. PLoS Genet, 12(2):e1005847, 2016.

Xiang Zhou, Peter Carbonetto, and Matthew Stephens. Polygenic modeling with

bayesian sparse linear mixed models. PLoS Genet, 9(2):e1003264, 2013.

237

theoretical and computational studies of bayesian …quan/papers/thesis.pdf · biology &...

Documents