jaya m. satagopan memorial sloan-kettering cancer center ...jaya m. satagopan memorial...

Introductory Bayesian Analysis

Jaya M. Satagopan Memorial Sloan-Kettering Cancer Center Weill Cornell Medical College (Affiliate)

[email protected]

March 14, 2013

Bayesian Analysis

•  Fit probability models to observed data

•  Unknown parameters –  Summarize using probability distribution –  For example, P(mutation increases risk by 10% | data) –  Posterior distribution

•  Prior information –  External data –  Elicit from available data

This lecture

•  Bayes theorem –  Prior from external source

•  Loss function, Expected loss

•  Bayesian analysis with data-adaptive prior –  Minimize squared error loss

•  Bayesian penalized estimation –  Prior to minimize other loss functions

•  Software packages –  Winbugs, SAS

Part 1. Bayes Theorem

Bayes Theorem

•  Random variables: Y and θ

•  Prior distributions: P(Y), P(θ)

•  Conditional distributions: P(Y | θ) and P(θ | Y)

•  Know P(θ | Y), P(Y), and P(θ) •  •  Need P(Y | θ) [posterior distribution]

P Y ! ( ) = P ! Y ( ) ! P Y ( )

P ! ( ) =

P ! Y ( ) ! P Y ( )P ! Y ( )P Y ( )dY"

Example

Say, 5% of the population has a certain disease. When

a person is sick, a particular test is used to determine

whether (s)he has this disease. The test gives a

positive result 2% of the times when a person actually

does not have the disease. The test gives a positive

result 95% of the times when the person does indeed

have the disease. Now, one person gets a positive test.

What is the probability the person has this disease?

Example continued

•  Y = 1 (disease) or 0 (no disease) •  θ = 1 (positive test) or 0 (negative test)

KNOWN: •  P(Y = 1) = 0.05 P(Y = 0) = 1 – P(Y = 1) = 0.95 •  P(θ = 1 | Y = 0) = 0.02 P(θ = 1 | Y = 1) = 0.95 NEED: •  P(Y = 1 | θ = 1)

P Y =1 ! = 1 ( ) = P ! = 1 Y= 1 ( ) P Y =1( )

P ! = 1 ( )

= P ! = 1 Y= 1 ( ) P Y =1( )

P ! = 1 Y = 1 ( ) P Y = 1 ( ) + P ! = 1 Y = 0 ( ) P Y = 0 ( )

= 0.95!0.050.95!0.05+0.02!0.95

= 0.714

Example – Breast Cancer Risk

•  Case-control sampling –  Cases (Y = 1) have breast cancer –  Controls (Y = 0) do not have breast cancer

•  Record BRCA1/2 mutation –  Mutation present (θ = 1) or absent (θ = 0)

•  Observe P(θ = 1 | Y = 1) and P(θ = 1 | Y = 0) –  Mutation frequency in cases and controls

•  Need: P(Y = 1 | θ = 1) –  Disease risk among mutation carriers

Satagopan et al (2001) CEBP, 10:467-473

Breast cancer risk (continued)

•  Use Bayes theorem

•  P(θ = 1 | Y = 1) = mutation frequency in cases •  P(θ = 1 | Y = 0) = mutation frequency in controls

•  P(Y = 1) = 1 – P(Y = 0) = prior information

•  Get prior from external source (SEER Registry)

P Y = 1 ! = 1 ( ) = P ! = 1 Y = 1 ( ) P Y =1( )

P ! = 1 Y = 1 ( ) P Y =1( ) + P ! = 1 Y = 0 ( ) P Y = 0( )

Breast cancer risk (continued)

BRCA Muta*on

Case Control

Present 25 23 Absent 179 1090

•  P(θ = 1 | Y = 1) = 25/204

•  P(θ = 1 | Y = 0) = 23/1113

•  P(Y = 1) = 0.0138 –  Disease risk in the 40-49

age group (SEER registry)

•  P(Y = 1 | θ = 1) = 7.6%

Data for Age group 40-49

http://seer.cancer.gov

Part 2. Loss function, Bayes estimate

Loss Function and Expected Loss

•  Parameter θ •  Decision (estimate) d(Y) based on data Y •  Loss incurred = L(d(Y), θ) ≥ 0

•  Squared error loss L(d(Y), θ) = [d(Y) - θ]2

•  Absolute deviation L(d(Y), θ) = |d(Y) - θ|

•  Expected loss = Risk = R(d,θ) = E{L(d(Y), θ)}

( ) ( )( ) ( )∫= dY Yf ,YdL ,dR θθθ

Bayes Estimation

•  There is no single d that has small R(d,θ) for all θ. –  No uniformly best d

•  Bayes approach

•  Get d that minimizes the average risk W(d). –  W(d) is also known as the Bayes risk

•  Bayes estimate dB of d: W(dB) ≤ W(d)

•  For squared error loss, dB is the posterior mean of θ –  dB(Y) = E(θ | Y)

( ) ( )( ) ( ) ( )∫ ∫= θθθ dG dY Yf ,YdL dW

Part 3. Bayesian analysis with data-adaptive prior parameters

GxE example

Bayesian analysis of GxE interactions

•  Case-control study Y = 1 (case) Y = 0 (control) •  Binary risk factors (say) •  Genetic factor: G = 0, 1 •  Environmental exposure: E = 0, 1

•  Is there a significant interaction between G and E ?

•  Estimate interaction odds ratio and standard error

Test: •  Is this odds ratio = 1? Is this log(odds ratio) = 0 ?

Mukherjee and Chatterjee (2008). Biometrics, 64: 685-694

Interaction odds ratio (ORGE)

Y = 0 (Control data)

E = 1 E = 0

G = 1 N011 N010

G = 0 N001 N000

Y = 1 (Case data)

E = 1 E = 0

G = 1 N111 N110

G = 0 N101 N100

OR0 = Odds of E associated with G among controls OR1 = Odds of E associated with G among cases

OR0 = N011 N000

N001 N010

OR1 = N111 N100

N101 N110

ORGE = OR1

OR0

( )( ) ( )

controlcase

11

GEGE

ORlogORlogORlog

ββ

β

ˆ - ˆ =

- = = ˆ Var !̂GE( ) = Var !̂case( ) + Var !̂control( )

Gene-Environment independence in controls

Y = 0 (Control data)

E = 1 E = 0

G = 1 N011 N010

G = 0 N001 N000

OR0 = N011 N000

N001 N010

= 1

ORGE = OR1

Var !̂GE( ) = Var !̂case( ) < Var !̂case( ) +Var !̂control( )

Independence of G and E in controls unknown. So … Test: βcontrol = 0 If hypothesis is rejected, estimate interaction OR as βGE = βcase - βcontrol. Otherwise, estimate as βGE = βcase Then test whether βGE = 0 for interaction Not a good idea !!

Weighted estimate

•  Estimate based on preliminary test T for β0 = 0

•  Weighted average of case-only and case-control

estimates. Weights are indicator functions

•  Can do better without requiring preliminary test !!

•  Choose w to minimize squared error loss

•  Bayes risk:

( ) GEcasePTGE, c>TI + c)<I(T = βββ ˆ ˆ ˆ

( ) GEcasewGE, w-1 + w = βββ ˆˆˆ

( ){ } - ˆ datadata GEwGE,GEEE βββ

Bayes estimate

•  w is function of and

Alternative explanation: –  e is error due to assuming G and E independence in controls

•  An estimate of e is: •  Prior for e: N(0, σ2). •  Bayes estimate of e is

•  M & C (2008) suggest estimating σ2 as

•  Empirical Bayes estimate:

ecaseBGE, ˆ ˆ += ββ

caseGEe ββ ˆ - ˆ ˆ = ( )2t e,Nee ~ ˆ

( ) et

eeE 22

2

ˆ ˆ +

=σσ

( )GEˆVar β ( )caseGE

2 ˆˆVar t ββ −=

( )GEˆVar β

( ) ˆ ˆ ˆ eeEcaseBGE, += ββ

Shrinkage estimation

Advanced Colorectal Adenoma Example

•  610 cases and 605 controls •  G = NAT2 acetylation (yes, no) •  E = Smoking (never, past, current) •  Note: lack of G and E independence in controls

–  Need case-control estimate •  EB estimate, credible interval. Is 0 in interval?

Summary

•  Uncertainty about underlying assumption •  Two possible estimates

•  Bayes estimate: weighted average of the two

•  Shrinkage estimation

•  Data-adaptive estimation of prior parameters –  Minimize squared error loss

Part 4. Bayesian penalized estimation

Prior to minimize various loss functions

Part 4a. Bayesian Ridge Regression

Minimize Squared Error Loss Normal Prior

GWAS data (Chen and Witte 2007, AJHG, 81: 397-404)

•  57 unrelated individuals of European ancestry (CEU) –  HapMap project

•  Outcome = Expression of the CHI3L2 gene –  Cheung et al 2005, Nature, 437: 1365-1369

•  Risk factors = 39,186 SNPs from Chromosome 1 –  Illumina 550K array from HapMap

•  SNP rs755467 deemed causal for CHI3L2 expression

•  Goal: How well are the neighboring SNPs ranked well?

Application to GWAS

•  Y = continuous (or binary) outcome, length N (subjects) •  Xm = m-th SNP, m = 1, 2, …, M (=500K, say)

•  For each SNP, model: Y = µm + Xmβm + error

•  βm is effect of SNP m MLE, std err, p-value

•  Find the significant SNPs

•  Find the SNPs having the 500 smallest p-values

Chen and Witte 2007. AJHG, 81: 397-404

Hierarchical modeling

•  Incorporate external information about SNPs •  Bioinformatics data (Z matrix, user-specified)

–  conservation, various functional categories

•  β = Zπ + U –  β length G, Z is G×K, π is K×1 –  U is N(0, t2T) T is specified

•  Improved estimation via second stage model

•  Prior for β is N(Zπ, t2T) –  Need {(β - Zπ)’T-1(β - Zπ)}/t2 to be small: Penalization

Posterior inference via MCMC

•  Markov chain Monte Carlo approach to get βs •  Specify prior for β, π, σ2 •  π ~ N(0, *) 1/σ2 ~ Gamma(**, $$) •  Specify prior for t2 or fix t2

•  Generate samples from full conditional distributions β  | Y, π, σ2, t2, … π | Y, β, σ2, t2, … σ2 | Y, β, π, t2, … etc.

Itera*on β parameters

1 β1 β2 βG

2 β1 β2 βG

…

G β1 β2 βG

Posterior Summaries

Avg(β1) Stdev (β1)

Avg(β2) Stdev (β2)

Avg(βG) Stdev (βG)

Chen and Witte GWAS Example

•  Plot “p-values” of top 500 SNPs

So, what is going on ?

•  Y = µm + Xmβm + error •  MLE of β’s •  Variance

•  β = Zπ + U, U ~ N(0, t2T) •  MLE of π’s

•  Bayes estimate of β’s

•  Large t2: S ≈ 0 ≈ W and •  Small t2: W ≈ I and

( ) ˆ , ,ˆ ,ˆ ˆG21 ββββ =

( ) ( ) 12T1T Tt V̂ S ,ˆSZSZZ ˆ−−

+== βπ

V̂

( ) V̂SW ,ˆWZˆW-I ~ =+≈ πββ

ββ ˆ ~ ≈

πβ ˆZ ~ =Shrinkage estimation

Some Remarks

•  Sensitivity to choice of prior parameters

•  Instead of “p-value”, P(βm > 0), m = 1, …, G

•  The Bayes estimate must ideally not be too sensitive to the choice of Z

•  The estimated value of π will depend upon Z, but ideally the Bayes estimate should not.

β~

Part 4b. Bayesian LASSO

Minimize Absolute deviation Laplace prior

Diabetes data (Efron et al 2004, The Annals of Statistics, 32: 407-499)

Application to the diabetes study

•  Y = continuous (or other type of) outcome (N×1) •  X = N×p vector of risk factors •  β = p×1 vector of effects (parameters of interest) •  Find the significant risk factors

•  Y = Xβ + error

•  Many p, potentially correlated risk factors etc

•  Estimate β to minimize |β - β0| for some β0 (LASSO)

•  β0 = 0 or β0 = Zπ, Z given and π must be estimated

Park and Casella (2008). J Am Stat Assoc, 103: 681-686

Bayesian LASSO

•  |β - β0| ≈ 1 – exp{ - |β - β0| } •  LHS takes the form of a Laplace distribution

•  Y = Xβ + error error ~ N(0, σ2I)

•  Laplace prior for β with mean β0

•  Mixture of normal prior for β and an exponential prior for its variance

( )

( ) 222

2

2

2

0

2j0j22

j0jj

dt t2

exp2

t21exp

21

exp2

f

⎭⎬⎫

⎩⎨⎧−

⎭⎬⎫

⎩⎨⎧ −−=

⎭⎬⎫

⎩⎨⎧ −−=

∫∞

σλ

σλ

ββπσ

ββσλ

σλ

β

Bayesian LASSO setup

( )( )

( )( )21

2

22j

2j

22j

2j

22

a,a Gamma Inverse ~

tindependen p , 1, j ,lexponentia ~ t

tindependen p , 1, j t 0, N ~ t ,

I ,X N ~ , Y

σ

λ

σσβ

σβσβ

=

=

•  tj2 are latent variables to facilitate MCMC steps

•  a1 and a2 are specified (check for sensitivity)

•  λ2 : empirical estimation from data or specify prior – Generally a Gamma(c1, c2) prior

Parameter Estimation

•  Get full conditionals, apply MCMC

•  Bayes estimate of β –  Posterior median

•  Original LASSO: quadratic programming methods

Part 4c. Other Bayesian Penalization Methods

Brief survey

Bridge Regression

•  Estimate β by minimizing

•  γ is pre-specified

•  γ = 1 is (Bayesian) LASSO

•  γ = 2 is (Bayesian) Ridge

∑=

−p

1jij Z

γπβ

Fu 1998, JCGS, 7: 397-416

Bayesian Elasticnet

•  Estimate β by minimizing

•  Compromise between LASSO and Ridge penalties

•  Normal prior constrained within certain bounds

•  Hans (2011). J Am Stat Assoc, 106: 1383-1393

( ) ( )∑∑==

−+−p

1j

2ij

p

1jij Z -1 Z πβλπβλ

Software Packages

•  WinBUGS –  Specify model for outcome –  Specify priors –  Output estimated values of β and other parameters –  Uses MCMC methods –  Diagnostic plots –  http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/

contents.shtml

•  SAS Proc MCMC –  http://support.sas.com/documentation/cdl/en/statug/63033/

HTML/default/viewer.htm#mcmc_toc.htm

References: Textbooks •  JS Maritz and T Lwin (1989). Empirical Bayes Methods.

Chapman and Hall.

•  JM Bernardo and AFM Smith (1993). Bayesian Theory. Wiley.

•  BP Carlin and TA Louis (1996). Bayes and empirical Bayes methods for data analysis. Chapman and Hall.

•  A Gelman, JB Carlin, HS Stern, DB Rubin (1996). Bayesian data analysis. Chapman and Hall.

•  WR Gilks, S Richardson, DJ Spiegelhalter (1996). Markov chain Monte Carlo in practice. Chapman and Hall.

•  T Hastie, R Tibshirani, J Friedman (2001). The Elements of Statistical Learning. Springer.

References: Some papers

•  R Tibshirani (1996). Regression shrinkage and selection via the Lasso. JRSS – Series B, 58: 267-288.

•  J Fu (1998). Penalized regression: The Bridge versus the Lasso. JCGS, 7: 397-416.

•  MA Newton and Y Lee (2000). Inferring the location and effect of tumor suppressor genes by instability-selection modeling of allelic-loss data. Biometrics 56: 1088-1097.

•  JM Satagopan, K Offit, W Foulkes, ME Robson, S Wacholder, CM Eng, SE Karp, CB Begg (2001). The lifetime risks of breast cancer in Ashkenazi Jewish carriers of BRCA1 and BRCA2 mutations. Cancer Epidemiology,Biomarkers and Prevention 10: 467-473.


•  CM Kendziorski, MA Newton, H Lan, MN Gould (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine 22:3899-3914.

•  D Conti, V Cortessis, J Molitor, DC Thomas (2003). Bayesian modeling of complex metabolic pathways. Human Heredity, 56: 83-93.

•  B Efron, T Hastie, I Johnstone, R Tibshirani (2004). Least angle regression. The Annals of Statistics, 32: 407-451.

•  B Mukherjee, N Chatterjee (2008). Exploiting gene-environment independence for analysis of case-control studies: An empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics, 64: 685-694.


•  GK Chen, JS Witte (2007). Enriching the analysis of genome-wide association studies with hierarchical modeling. AJHG, 81: 397-404.

•  T Park, G Casella (2008). The Bayesian Lasso. JASA, 103: 681-686.

•  M Park, T Hastie (2008). Penalized logistic regression for .detecting gene interactions. Biostatistics, 9: 30-50

•  C Hans (2011). Elastic net regression modeling with the orthant normal prior. JASA, 106: 1383-1393.

Many more: Bioinformatics, Genetic Epidemiology, JASA, JRSS – Series B and C, PLoS One, …

jaya m. satagopan memorial sloan-kettering cancer center ...jaya m. satagopan memorial...

Documents