statistical methods for analysis with missing data · statistical methods for analysis with missing...

Statistical Methods for Analysis with Missing Data

Lecture 8: introduction to Bayesian inference

Mauricio Sadinle

Department of Biostatistics

1 / 31

Previous Lecture

EM algorithm for maximum likelihood estimation with missing data:

I Derivation and implementation of EM algorithm for categorical data

I For two categorical variables, EM under MAR is intuitive and simple

π(t+1)kl =

(n11kl +

π(t)kl

π(t)k+

n10k+ +π

π(t)+l

n01+l + π(t)kl n00++

n11kl =∑i

Wikl I (ri = 11), n10k+ =∑i

Wik+I (ri = 10),

n01+l =∑i

Wi+l I (ri = 01), n00++ =∑i

I (ri = 00)

I Bootstrap confidence intervals

2 / 31

Today’s Lecture

I Introduction to Bayesian inference. Why?

I Technique called multiple imputation is derived from a Bayesianpoint of view

I Variation of multiple imputation called multiple imputation bychained equations imitates Gibbs sampling – a procedure commonlyused for Bayesian inference

I Bayesian inference has its own ways of handling missing data

3 / 31

Outline

Bayes’ Theorem

Bayesian Inference

4 / 31

Bayes’ Theorem

For events A and B:

P(A | B) =P(B | A)P(A)

I P(A | B): conditional probability of event A given that B is true

I P(B | A): conditional probability of event B given that A is true

I P(A) and P(B): unconditional/marginal probabilities of events Aand B

5 / 31

Example: Using Bayes’ Theorem

I A: person has a given condition

I B: person tests positive for condition using cheap test

I P(B | A), P(B | Ac): known from experimental testing

I P(A): known from study based on gold standard

I P(A | B) = P(B | A)P(A)/P(B): probability of having thecondition given positive result in cheap test

6 / 31

Example: Using Bayes’ TheoremI A: person has a given condition

I B: person tests positive for condition using cheap test

I P(B | A) = 0.99: sensitivity of the test

I P(Bc | Ac) = 0.99: specificity of the test

I P(A) = 0.01: rare condition

I Given that a generic person tests positive, what is the probabilitythat they have the condition?

P(A | B) =P(B | A)P(A)

P(B | A)P(A) + P(B | Ac)P(Ac)

=.99× .01

.99× .01 + .01× .99= 0.5

7 / 31

Bayes’ Theorem

In the above example, we are implicitly working with two binary variables

I Y = I (having medical condition)

I X = I (testing positive)

with a joint density p(X = x ,Y = y)

I Bayes’ theorem simply relates the conditional probabilitiesp(X = x | Y = y) and p(Y = y | X = x)

8 / 31

Bayes’ TheoremIn general

I Random vectors X and Y

I Joint density p(x , y)

I Relationship between conditionals is given by Bayes’ theorem

p(x | y) =p(y | x)p(x)

I This is a mathematical fact about conditional probabilities/densities

I Applying Bayes’ theorem doesn’t make you Bayesian!

I The Bayesian approach arises from applying this theorem to obtainstatistical inferences on model parameters!

9 / 31

p(x | y) =p(y | x)p(x)

9 / 31

p(x | y) =p(y | x)p(x)

9 / 31

Outline

Bayes’ Theorem

Bayesian Inference

10 / 31

Philosophical Motivation

I Bayesian paradigm: model parameters are treated as randomvariables to represent uncertainty about their true values

I Compare with the frequentist paradigm: model parameters areunknown constants

I Bayesian analysis requires:

I Likelihood function, coming from a model for the distribution of thedata

I Prior distribution on model parameters, coming from previousknowledge about the phenomenon under study

I Prior distribution is updated using the likelihood via Bayes’ theorem,resulting in a posterior distribution

I Bayesian inferences are drawn from posterior distribution (updatedknowledge)

11 / 31

Practical MotivationBayesian machinery might make this approach appealing

I Quantification of uncertainty in large discrete parameter spaces

I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage

I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest

I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE

I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty

I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy

12 / 31

The Likelihood FunctionSame as before:

I Z = (Z1, . . . ,ZK ): generic vector of study variables

I We work under a parametric model for the distribution of Z

{p(z | θ)}θ, θ = (θ1, θ2, . . . , θd)

I Data from random i.i.d. vectors {Zi}ni=1 ≡ Z

I Under our parametric model, the joint distribution of {Zi}ni=1 has adensity function

p(z | θ) =n∏

p(zi | θ)

I This, seen as a function of θ, is the likelihood function

L(θ | z) =n∏

p(zi | θ)

13 / 31

{p(z | θ)}θ, θ = (θ1, θ2, . . . , θd)

p(z | θ) =n∏

p(zi | θ)

L(θ | z) =n∏

p(zi | θ)

13 / 31

{p(z | θ)}θ, θ = (θ1, θ2, . . . , θd)

p(z | θ) =n∏

p(zi | θ)

L(θ | z) =n∏

p(zi | θ)

13 / 31

The Prior Distribution

I Prior to observing the realizations of Z = {Zi}ni=1, do we have anyinformation on the parameters θ?

I Represent this prior information in terms of a distribution

14 / 31

The Posterior Distribution

Now, “simply” use Bayes’ theorem

p(θ | z) =L(θ | z)p(θ)

=L(θ | z)p(θ)∫L(θ | z)p(θ)dθ

∝ L(θ | z)p(θ)

“That’s it! ”

15 / 31

For simple problems, we typically have two ways of computing theposterior p(θ | z)

I Compute the integral∫L(θ | z)p(θ)dθ, and then compute

p(θ | z) =L(θ | z)p(θ)∫L(θ | z)p(θ)dθ

I Stare at / manipulate the expression L(θ | z)p(θ) seen as a functionof θ alone

I If L(θ | z)p(θ) = a(θ, z)b(z), then p(θ | z) ∝ a(θ, z)

I If a(θ, z) looks like a known distribution except for a constant, thenwe have identified the posterior

16 / 31

For simple problems, we typically have two ways of computing theposterior p(θ | z)

I Compute the integral∫L(θ | z)p(θ)dθ, and then compute

I Stare at / manipulate the expression L(θ | z)p(θ) seen as a functionof θ alone

I If L(θ | z)p(θ) = a(θ, z)b(z), then p(θ | z) ∝ a(θ, z)

I If a(θ, z) looks like a known distribution except for a constant, thenwe have identified the posterior

16 / 31

Example: Binomial Data, Beta Prior

Let Z | θ ∼ Binom(n, θ), and θ ∼ Beta(a, b)

I L(θ | z) =(nz

)θz(1− θ)n−z , p(θ) = 1

B(a,b)θa−1(1− θ)b−1

I The proportionality constant is∫L(θ | z)p(θ)dθ =

)B(a, b)

∫θz+a−1(1− θ)n−z+b−1dθ

)B(z + a, n − z + b)

B(a, b)

I And the posterior is

B(z + a, n − z + b)θz+a−1(1− θ)n−z+b−1

I Therefore, θ | z ∼ Beta(z + a, n − z + b)

17 / 31

I L(θ | z) =(nz

)θz(1− θ)n−z , p(θ) = 1

B(a,b)θa−1(1− θ)b−1

)B(a, b)

)B(z + a, n − z + b)

B(a, b)

B(z + a, n − z + b)θz+a−1(1− θ)n−z+b−1

17 / 31

I L(θ | z) =(nz

)θz(1− θ)n−z , p(θ) = 1

B(a,b)θa−1(1− θ)b−1

)B(a, b)

)B(z + a, n − z + b)

B(a, b)

B(z + a, n − z + b)θz+a−1(1− θ)n−z+b−1

17 / 31

I L(θ | z) =(nz

)θz(1− θ)n−z , p(θ) = 1

B(a,b)θa−1(1− θ)b−1

)B(a, b)

)B(z + a, n − z + b)

B(a, b)

B(z + a, n − z + b)θz+a−1(1− θ)n−z+b−1

17 / 31

Let Z | θ ∼ Binom(n, θ) and θ ∼ Beta(a, b)

I L(θ | z) =(nz

)θz(1− θ)n−z , p(θ) = 1

B(a,b)θa−1(1− θ)b−1

I We could also have noticed

p(θ | z) ∝ L(θ | z)p(θ)

∝ θz+a−1(1− θ)n−z+b−1

I This is the non-constant part (the kernel) of the density function ofa beta random variable with parameters z + a and n − z + b,therefore θ | z ∼ Beta(z + a, n − z + b)

18 / 31

Let Z | θ ∼ Binom(n, θ) and θ ∼ Beta(a, b)

I L(θ | z) =(nz

)θz(1− θ)n−z , p(θ) = 1

B(a,b)θa−1(1− θ)b−1

I We could also have noticed

p(θ | z) ∝ L(θ | z)p(θ)

∝ θz+a−1(1− θ)n−z+b−1

I This is the non-constant part (the kernel) of the density function ofa beta random variable with parameters z + a and n − z + b,therefore θ | z ∼ Beta(z + a, n − z + b)

18 / 31

To illustrate the idea, say:

I Someone is flipping a coin n = 10 times in an independent andidentical fashion

I Number of heads Z ∼ Binomial(n, θ)

I What is the value of θ?

I We use a Beta(a, b) to express our prior belief on θ

I We observe Z = 8

19 / 31

Example: Binomial Data, Beta PriorPossible scenarios of prior information on θ; posteriors with Z = 8:

I No idea of what θ could be: Beta(1,1)

0.0 0.2 0.4 0.6 0.8 1.0

PriorPosterior

I The person flipping the coin looks like a trickster: Beta(9,1)

0.0 0.2 0.4 0.6 0.8 1.0

PriorPosterior

I Coin flipping usually has 50/50 chance of heads/tails: Beta(100,100)

0.0 0.2 0.4 0.6 0.8 1.0

PriorPosterior

20 / 31

0.0 0.2 0.4 0.6 0.8 1.0

PriorPosterior

0.0 0.2 0.4 0.6 0.8 1.0

PriorPosterior

0.0 0.2 0.4 0.6 0.8 1.0

PriorPosterior

20 / 31

0.0 0.2 0.4 0.6 0.8 1.0

PriorPosterior

0.0 0.2 0.4 0.6 0.8 1.0

PriorPosterior

0.0 0.2 0.4 0.6 0.8 1.0

PriorPosterior

20 / 31

Comments So Far

I Bayesian approach allows you to incorporate side information basedon context

I Do you know what “flipping a coin” means?I Is the person flipping the coin someone you trust?

I It seems expressing prior information works nicely with someparametric families

I Quantification of prior information can be tricky, especially forcomplicated models

21 / 31

Comments So Far

21 / 31

Comments So Far

21 / 31

Example: Multinomial Data, Dirichlet PriorContinuing our example from the previous class:

I Let Zi = (Zi1,Zi2), Zi1,Zi2 ∈ {1, 2}, Zi ’s are i.i.d.,

p(Zi1 = k ,Zi2 = l | θ) = πkl

I θ = (. . . , πkl , . . . ), Wikl = I (Zi1 = k ,Zi2 = l)

I The likelihood of the study variables is

L(θ | z) =∏i

∏k,l

πWikl

=∏k,l

πnklkl

wherenkl =

Wikl , k , l ∈ {1, 2}

22 / 31

Example: Multinomial Data, Dirichlet PriorContinuing our example from the previous class:

I Let Zi = (Zi1,Zi2), Zi1,Zi2 ∈ {1, 2}, Zi ’s are i.i.d.,

p(Zi1 = k ,Zi2 = l | θ) = πkl

I θ = (. . . , πkl , . . . ), Wikl = I (Zi1 = k ,Zi2 = l)

I The likelihood of the study variables is

L(θ | z) =∏i

∏k,l

πWikl

=∏k,l

πnklkl

wherenkl =

Wikl , k , l ∈ {1, 2}

22 / 31

Example: Multinomial Data, Dirichlet Prior

I Inference on multinomial parameters is convenient using theDirichlet prior

I θ = (. . . , πkl , . . . ) ∼ Dirichlet(α), α = (. . . , αkl , . . . ),

p(θ) =Γ(∑αkl)∏

k,l Γ(αkl)

∏k,l

παkl−1kl

I The posterior is given by

p(θ | z) ∝ L(θ | z)p(θ)

∝∏k,l

πnkl+αkl−1kl

I Therefore, θ | z ∼ Dirichlet(α′), α′ = (. . . , αkl + nkl , . . . )

23 / 31

Example: Multinomial Data, Dirichlet Prior

I Inference on multinomial parameters is convenient using theDirichlet prior

I θ = (. . . , πkl , . . . ) ∼ Dirichlet(α), α = (. . . , αkl , . . . ),

p(θ) =Γ(∑αkl)∏

k,l Γ(αkl)

∏k,l

παkl−1kl

I The posterior is given by

p(θ | z) ∝ L(θ | z)p(θ)

∝∏k,l

πnkl+αkl−1kl

I Therefore, θ | z ∼ Dirichlet(α′), α′ = (. . . , αkl + nkl , . . . )

23 / 31

Comments on Priors

I There is no guarantee that the combination of arbitrary likelihoodsand priors will lead to posteriors that are easy to work with

I Conjugate priors are distributions that lead to posteriors in the samefamily, as in the previous examples – typically easier to work with,but not always available

I Non-conjugate priors can be used, but we require more advancedtechniques for handling them

I Lists of conjugate priors are available in multiple books and onlineresources

24 / 31

Comments on Priors

24 / 31

Comments on Priors

24 / 31

Comments on Priors

24 / 31

Example: Multivariate Normal

I Distribution of the data

Z = {Zi}ni=1 | µ,Λi.i.d.∼ Normal(µ,Λ−1)

where Zi ∈ RK , µ is the vector of means, Λ−1 is the covariancematrix, and Λ is the inverse covariance matrix (the precision matrix)

I Conjugate prior is constructed in two steps

µ | Λ ∼ Normal(µ0, (κ0Λ)−1)

Λ ∼Wishart(υ0,W0)

Joint distribution of (µ,Λ) is called Normal-Wishart. Theparameterization is such that E (Λ) = υ0W0.

25 / 31

I Distribution of the data

Z = {Zi}ni=1 | µ,Λi.i.d.∼ Normal(µ,Λ−1)

where Zi ∈ RK , µ is the vector of means, Λ−1 is the covariancematrix, and Λ is the inverse covariance matrix (the precision matrix)

I Conjugate prior is constructed in two steps

µ | Λ ∼ Normal(µ0, (κ0Λ)−1)

Λ ∼Wishart(υ0,W0)

Joint distribution of (µ,Λ) is called Normal-Wishart. Theparameterization is such that E (Λ) = υ0W0.

25 / 31

Posterior is also Normal-Wishart

µ | Λ, z ∼ Normal(µ′, (κ′Λ)−1)

Λ | z ∼Wishart(υ′,W ′)

µ′ = (κ0µ0 + nz̄)/κ′

κ′ = κ0 + n

υ′ = υ0 + n

W ′ = {W−10 + n[Σ̂ +

κ′(z̄ − µ0)(z̄ − µ0)T ]}−1

z̄ =n∑

Σ̂ =n∑

(zi − z̄)(zi − z̄)T/n

26 / 31

Bayesian Point Estimation

I L(θ, θ′): loss of estimating a parameter to be θ′ when the true valueis θ

I Bayes estimator minimizes the expected posterior loss

θ̂Bayes = arg minθ′

∫L(θ, θ′)p(θ | z)dθ

I For univariate θ it is common to choose

I L(θ, θ′) = (θ − θ′)2 =⇒ θ̂Bayes is posterior mean

I L(θ, θ′) = |θ − θ′| =⇒ θ̂Bayes is posterior median

I L(θ, θ′) = I (θ 6= θ′) =⇒ θ̂Bayes is posterior mode

27 / 31

Bayesian Credible Sets/Intervals

I C is a (1− α)100% credible set if∫C

p(θ | z)dθ ≥ 1− α

I For univariate θ, define the (1− α)100% credible interval C as theinterval within the α/2 and 1− α/2 quantiles of the posteriorp(θ | z)

28 / 31

Bayesian Credible Sets/Intervals

I C is a (1− α)100% credible set if∫C

p(θ | z)dθ ≥ 1− α

I For univariate θ, define the (1− α)100% credible interval C as theinterval within the α/2 and 1− α/2 quantiles of the posteriorp(θ | z)

28 / 31

Asymptotic Behavior of Posteriors: Bernstein - von Mises

Bernstein - von Mises theorem1

I Under some conditions, the posterior distribution asymptoticallybehaves like the sampling distribution of the MLE

I Heuristically, we say

p(θ | z) ≈ N (θ̂MLE, I(θ̂MLE)−1/n)

I Therefore, for well-behaved models and with a good amount of data,Bayesian and frequentist inferences will be similar

1For details, see lecture notes of Richard Nickl:http://www.statslab.cam.ac.uk/~nickl/Site/__files/stat2013.pdf

29 / 31

Missing Data and BayesI With missing data, things get complicated

Lobs(θ, ψ | z(r), r) =n∏

∫Z(r̄i )

p(ri | zi , ψ)p(zi | θ) dzi(r̄i )

p(ri | zi(ri ), ψ)

]︸︷︷︸

Can be ignored

∫Z(r̄i )

p(zi | θ) dzi(r̄i )

]︸︷︷︸

Likelihood for θ under MAR

I Under a Bayesian approach, we need to obtain

p(θ, ψ | z(r), r) ∝ Lobs(θ, ψ | z(r), r)p(θ, ψ)

I Under ignorability (MAR + separability + θ ⊥⊥ ψ a priori), we need

p(θ | z(r)) ∝ Lobs(θ | z(r))p(θ)

I Integrals in Lobs(θ | z(r)) complicate things – what to do?

30 / 31

∫Z(r̄i )

p(ri | zi(ri ), ψ)

]︸︷︷︸

Can be ignored

∫Z(r̄i )

]︸︷︷︸

30 / 31

∫Z(r̄i )

p(ri | zi(ri ), ψ)

]︸︷︷︸

Can be ignored

∫Z(r̄i )

]︸︷︷︸

30 / 31

∫Z(r̄i )

p(ri | zi(ri ), ψ)

]︸︷︷︸

Can be ignored

∫Z(r̄i )

]︸︷︷︸

30 / 31

SummaryMain take-aways from today’s lecture:

I Using Bayes’ theorem doesn’t make you a Bayesian

I Bayesian inference offers alternative framework for derivinginferences from data

I Philosophical motivation: inclusion of prior belief or knowledge,uncertainty quantification in terms of distributions for parameters

I Practical motivation: convenient in some problems, might lead togood frequentist performance

I Complex problems are computationally challenging – posterior needsto be approximated (e.g., Markov chain Monte Carlo)

Next lecture:

I Gibbs sampling

I Data augmentation

I Introduction to multiple imputation31 / 31

SummaryMain take-aways from today’s lecture:

I Using Bayes’ theorem doesn’t make you a Bayesian

I Bayesian inference offers alternative framework for derivinginferences from data

I Philosophical motivation: inclusion of prior belief or knowledge,uncertainty quantification in terms of distributions for parameters

I Practical motivation: convenient in some problems, might lead togood frequentist performance

I Complex problems are computationally challenging – posterior needsto be approximated (e.g., Markov chain Monte Carlo)

Next lecture:

I Gibbs sampling

I Data augmentation

I Introduction to multiple imputation31 / 31

statistical methods for analysis with missing data · statistical methods for analysis with missing...

Documents