statistical methods for analysis with missing data · statistical methods for analysis with missing...
TRANSCRIPT
Statistical Methods for Analysis with Missing Data
Lecture 8: introduction to Bayesian inference
Mauricio Sadinle
Department of Biostatistics
1 / 31
Previous Lecture
EM algorithm for maximum likelihood estimation with missing data:
I Derivation and implementation of EM algorithm for categorical data
I For two categorical variables, EM under MAR is intuitive and simple
π(t+1)kl =
1
n
(n11kl +
π(t)kl
π(t)k+
n10k+ +π
(t)kl
π(t)+l
n01+l + π(t)kl n00++
),
where
n11kl =∑i
Wikl I (ri = 11), n10k+ =∑i
Wik+I (ri = 10),
n01+l =∑i
Wi+l I (ri = 01), n00++ =∑i
I (ri = 00)
I Bootstrap confidence intervals
2 / 31
Today’s Lecture
I Introduction to Bayesian inference. Why?
I Technique called multiple imputation is derived from a Bayesianpoint of view
I Variation of multiple imputation called multiple imputation bychained equations imitates Gibbs sampling – a procedure commonlyused for Bayesian inference
I Bayesian inference has its own ways of handling missing data
3 / 31
Outline
Bayes’ Theorem
Bayesian Inference
4 / 31
Bayes’ Theorem
For events A and B:
P(A | B) =P(B | A)P(A)
P(B)
I P(A | B): conditional probability of event A given that B is true
I P(B | A): conditional probability of event B given that A is true
I P(A) and P(B): unconditional/marginal probabilities of events Aand B
5 / 31
Example: Using Bayes’ Theorem
I A: person has a given condition
I B: person tests positive for condition using cheap test
I P(B | A), P(B | Ac): known from experimental testing
I P(A): known from study based on gold standard
I P(A | B) = P(B | A)P(A)/P(B): probability of having thecondition given positive result in cheap test
6 / 31
Example: Using Bayes’ TheoremI A: person has a given condition
I B: person tests positive for condition using cheap test
I P(B | A) = 0.99: sensitivity of the test
I P(Bc | Ac) = 0.99: specificity of the test
I P(A) = 0.01: rare condition
I Given that a generic person tests positive, what is the probabilitythat they have the condition?
P(A | B) =P(B | A)P(A)
P(B | A)P(A) + P(B | Ac)P(Ac)
=.99× .01
.99× .01 + .01× .99= 0.5
7 / 31
Bayes’ Theorem
In the above example, we are implicitly working with two binary variables
I Y = I (having medical condition)
I X = I (testing positive)
with a joint density p(X = x ,Y = y)
I Bayes’ theorem simply relates the conditional probabilitiesp(X = x | Y = y) and p(Y = y | X = x)
8 / 31
Bayes’ TheoremIn general
I Random vectors X and Y
I Joint density p(x , y)
I Relationship between conditionals is given by Bayes’ theorem
p(x | y) =p(y | x)p(x)
p(y)
I This is a mathematical fact about conditional probabilities/densities
I Applying Bayes’ theorem doesn’t make you Bayesian!
I The Bayesian approach arises from applying this theorem to obtainstatistical inferences on model parameters!
9 / 31
Bayes’ TheoremIn general
I Random vectors X and Y
I Joint density p(x , y)
I Relationship between conditionals is given by Bayes’ theorem
p(x | y) =p(y | x)p(x)
p(y)
I This is a mathematical fact about conditional probabilities/densities
I Applying Bayes’ theorem doesn’t make you Bayesian!
I The Bayesian approach arises from applying this theorem to obtainstatistical inferences on model parameters!
9 / 31
Bayes’ TheoremIn general
I Random vectors X and Y
I Joint density p(x , y)
I Relationship between conditionals is given by Bayes’ theorem
p(x | y) =p(y | x)p(x)
p(y)
I This is a mathematical fact about conditional probabilities/densities
I Applying Bayes’ theorem doesn’t make you Bayesian!
I The Bayesian approach arises from applying this theorem to obtainstatistical inferences on model parameters!
9 / 31
Outline
Bayes’ Theorem
Bayesian Inference
10 / 31
Philosophical Motivation
I Bayesian paradigm: model parameters are treated as randomvariables to represent uncertainty about their true values
I Compare with the frequentist paradigm: model parameters areunknown constants
I Bayesian analysis requires:
I Likelihood function, coming from a model for the distribution of thedata
I Prior distribution on model parameters, coming from previousknowledge about the phenomenon under study
I Prior distribution is updated using the likelihood via Bayes’ theorem,resulting in a posterior distribution
I Bayesian inferences are drawn from posterior distribution (updatedknowledge)
11 / 31
Philosophical Motivation
I Bayesian paradigm: model parameters are treated as randomvariables to represent uncertainty about their true values
I Compare with the frequentist paradigm: model parameters areunknown constants
I Bayesian analysis requires:
I Likelihood function, coming from a model for the distribution of thedata
I Prior distribution on model parameters, coming from previousknowledge about the phenomenon under study
I Prior distribution is updated using the likelihood via Bayes’ theorem,resulting in a posterior distribution
I Bayesian inferences are drawn from posterior distribution (updatedknowledge)
11 / 31
Philosophical Motivation
I Bayesian paradigm: model parameters are treated as randomvariables to represent uncertainty about their true values
I Compare with the frequentist paradigm: model parameters areunknown constants
I Bayesian analysis requires:
I Likelihood function, coming from a model for the distribution of thedata
I Prior distribution on model parameters, coming from previousknowledge about the phenomenon under study
I Prior distribution is updated using the likelihood via Bayes’ theorem,resulting in a posterior distribution
I Bayesian inferences are drawn from posterior distribution (updatedknowledge)
11 / 31
Philosophical Motivation
I Bayesian paradigm: model parameters are treated as randomvariables to represent uncertainty about their true values
I Compare with the frequentist paradigm: model parameters areunknown constants
I Bayesian analysis requires:
I Likelihood function, coming from a model for the distribution of thedata
I Prior distribution on model parameters, coming from previousknowledge about the phenomenon under study
I Prior distribution is updated using the likelihood via Bayes’ theorem,resulting in a posterior distribution
I Bayesian inferences are drawn from posterior distribution (updatedknowledge)
11 / 31
Philosophical Motivation
I Bayesian paradigm: model parameters are treated as randomvariables to represent uncertainty about their true values
I Compare with the frequentist paradigm: model parameters areunknown constants
I Bayesian analysis requires:
I Likelihood function, coming from a model for the distribution of thedata
I Prior distribution on model parameters, coming from previousknowledge about the phenomenon under study
I Prior distribution is updated using the likelihood via Bayes’ theorem,resulting in a posterior distribution
I Bayesian inferences are drawn from posterior distribution (updatedknowledge)
11 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
Practical MotivationBayesian machinery might make this approach appealing
I Quantification of uncertainty in large discrete parameter spaces
I Partitions in clustering problemsI Graphs in graphical modelsI Binary vectors of variable inclusion in regression model selectionI Bipartite matchings in record linkage
I Frequentist approach calls for constructing confidence sets, whereasBayesian approach works with a posterior distribution from which wecan sample to approximate summaries of interest
I Under some conditions, posteriors behave like sampling distributionof MLEs, but Bayesian machinery might be easier to implement thanderiving estimate of asymptotic covariance matrix of MLE
I Implementing a Monte Carlo EM algorithm to obtain MLEs issimilar to implementing Data Augmentation (coming soon), but thelatter readily provide you with measures of uncertainty
I Convenient in hierarchical/multilevel models – priors just addanother level to the hierarchy
12 / 31
The Likelihood FunctionSame as before:
I Z = (Z1, . . . ,ZK ): generic vector of study variables
I We work under a parametric model for the distribution of Z
{p(z | θ)}θ, θ = (θ1, θ2, . . . , θd)
I Data from random i.i.d. vectors {Zi}ni=1 ≡ Z
I Under our parametric model, the joint distribution of {Zi}ni=1 has adensity function
p(z | θ) =n∏
i=1
p(zi | θ)
I This, seen as a function of θ, is the likelihood function
L(θ | z) =n∏
i=1
p(zi | θ)
13 / 31
The Likelihood FunctionSame as before:
I Z = (Z1, . . . ,ZK ): generic vector of study variables
I We work under a parametric model for the distribution of Z
{p(z | θ)}θ, θ = (θ1, θ2, . . . , θd)
I Data from random i.i.d. vectors {Zi}ni=1 ≡ Z
I Under our parametric model, the joint distribution of {Zi}ni=1 has adensity function
p(z | θ) =n∏
i=1
p(zi | θ)
I This, seen as a function of θ, is the likelihood function
L(θ | z) =n∏
i=1
p(zi | θ)
13 / 31
The Likelihood FunctionSame as before:
I Z = (Z1, . . . ,ZK ): generic vector of study variables
I We work under a parametric model for the distribution of Z
{p(z | θ)}θ, θ = (θ1, θ2, . . . , θd)
I Data from random i.i.d. vectors {Zi}ni=1 ≡ Z
I Under our parametric model, the joint distribution of {Zi}ni=1 has adensity function
p(z | θ) =n∏
i=1
p(zi | θ)
I This, seen as a function of θ, is the likelihood function
L(θ | z) =n∏
i=1
p(zi | θ)
13 / 31
The Prior Distribution
I Prior to observing the realizations of Z = {Zi}ni=1, do we have anyinformation on the parameters θ?
I Represent this prior information in terms of a distribution
p(θ)
14 / 31
The Posterior Distribution
Now, “simply” use Bayes’ theorem
p(θ | z) =L(θ | z)p(θ)
p(z)
=L(θ | z)p(θ)∫L(θ | z)p(θ)dθ
∝ L(θ | z)p(θ)
“That’s it! ”
15 / 31
The Posterior Distribution
For simple problems, we typically have two ways of computing theposterior p(θ | z)
I Compute the integral∫L(θ | z)p(θ)dθ, and then compute
p(θ | z) =L(θ | z)p(θ)∫L(θ | z)p(θ)dθ
I Stare at / manipulate the expression L(θ | z)p(θ) seen as a functionof θ alone
I If L(θ | z)p(θ) = a(θ, z)b(z), then p(θ | z) ∝ a(θ, z)
I If a(θ, z) looks like a known distribution except for a constant, thenwe have identified the posterior
16 / 31
The Posterior Distribution
For simple problems, we typically have two ways of computing theposterior p(θ | z)
I Compute the integral∫L(θ | z)p(θ)dθ, and then compute
p(θ | z) =L(θ | z)p(θ)∫L(θ | z)p(θ)dθ
I Stare at / manipulate the expression L(θ | z)p(θ) seen as a functionof θ alone
I If L(θ | z)p(θ) = a(θ, z)b(z), then p(θ | z) ∝ a(θ, z)
I If a(θ, z) looks like a known distribution except for a constant, thenwe have identified the posterior
16 / 31
Example: Binomial Data, Beta Prior
Let Z | θ ∼ Binom(n, θ), and θ ∼ Beta(a, b)
I L(θ | z) =(nz
)θz(1− θ)n−z , p(θ) = 1
B(a,b)θa−1(1− θ)b−1
I The proportionality constant is∫L(θ | z)p(θ)dθ =
(nz
)B(a, b)
∫θz+a−1(1− θ)n−z+b−1dθ
=(nz
)B(z + a, n − z + b)
B(a, b)
I And the posterior is
p(θ | z) =L(θ | z)p(θ)∫L(θ | z)p(θ)dθ
=1
B(z + a, n − z + b)θz+a−1(1− θ)n−z+b−1
I Therefore, θ | z ∼ Beta(z + a, n − z + b)
17 / 31
Example: Binomial Data, Beta Prior
Let Z | θ ∼ Binom(n, θ), and θ ∼ Beta(a, b)
I L(θ | z) =(nz
)θz(1− θ)n−z , p(θ) = 1
B(a,b)θa−1(1− θ)b−1
I The proportionality constant is∫L(θ | z)p(θ)dθ =
(nz
)B(a, b)
∫θz+a−1(1− θ)n−z+b−1dθ
=(nz
)B(z + a, n − z + b)
B(a, b)
I And the posterior is
p(θ | z) =L(θ | z)p(θ)∫L(θ | z)p(θ)dθ
=1
B(z + a, n − z + b)θz+a−1(1− θ)n−z+b−1
I Therefore, θ | z ∼ Beta(z + a, n − z + b)
17 / 31
Example: Binomial Data, Beta Prior
Let Z | θ ∼ Binom(n, θ), and θ ∼ Beta(a, b)
I L(θ | z) =(nz
)θz(1− θ)n−z , p(θ) = 1
B(a,b)θa−1(1− θ)b−1
I The proportionality constant is∫L(θ | z)p(θ)dθ =
(nz
)B(a, b)
∫θz+a−1(1− θ)n−z+b−1dθ
=(nz
)B(z + a, n − z + b)
B(a, b)
I And the posterior is
p(θ | z) =L(θ | z)p(θ)∫L(θ | z)p(θ)dθ
=1
B(z + a, n − z + b)θz+a−1(1− θ)n−z+b−1
I Therefore, θ | z ∼ Beta(z + a, n − z + b)
17 / 31
Example: Binomial Data, Beta Prior
Let Z | θ ∼ Binom(n, θ), and θ ∼ Beta(a, b)
I L(θ | z) =(nz
)θz(1− θ)n−z , p(θ) = 1
B(a,b)θa−1(1− θ)b−1
I The proportionality constant is∫L(θ | z)p(θ)dθ =
(nz
)B(a, b)
∫θz+a−1(1− θ)n−z+b−1dθ
=(nz
)B(z + a, n − z + b)
B(a, b)
I And the posterior is
p(θ | z) =L(θ | z)p(θ)∫L(θ | z)p(θ)dθ
=1
B(z + a, n − z + b)θz+a−1(1− θ)n−z+b−1
I Therefore, θ | z ∼ Beta(z + a, n − z + b)
17 / 31
Example: Binomial Data, Beta Prior
Let Z | θ ∼ Binom(n, θ) and θ ∼ Beta(a, b)
I L(θ | z) =(nz
)θz(1− θ)n−z , p(θ) = 1
B(a,b)θa−1(1− θ)b−1
I We could also have noticed
p(θ | z) ∝ L(θ | z)p(θ)
∝ θz+a−1(1− θ)n−z+b−1
I This is the non-constant part (the kernel) of the density function ofa beta random variable with parameters z + a and n − z + b,therefore θ | z ∼ Beta(z + a, n − z + b)
18 / 31
Example: Binomial Data, Beta Prior
Let Z | θ ∼ Binom(n, θ) and θ ∼ Beta(a, b)
I L(θ | z) =(nz
)θz(1− θ)n−z , p(θ) = 1
B(a,b)θa−1(1− θ)b−1
I We could also have noticed
p(θ | z) ∝ L(θ | z)p(θ)
∝ θz+a−1(1− θ)n−z+b−1
I This is the non-constant part (the kernel) of the density function ofa beta random variable with parameters z + a and n − z + b,therefore θ | z ∼ Beta(z + a, n − z + b)
18 / 31
Example: Binomial Data, Beta Prior
To illustrate the idea, say:
I Someone is flipping a coin n = 10 times in an independent andidentical fashion
I Number of heads Z ∼ Binomial(n, θ)
I What is the value of θ?
I We use a Beta(a, b) to express our prior belief on θ
I We observe Z = 8
19 / 31
Example: Binomial Data, Beta PriorPossible scenarios of prior information on θ; posteriors with Z = 8:
I No idea of what θ could be: Beta(1,1)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
3.0
θ
PriorPosterior
I The person flipping the coin looks like a trickster: Beta(9,1)
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
θ
PriorPosterior
I Coin flipping usually has 50/50 chance of heads/tails: Beta(100,100)
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
θ
PriorPosterior
20 / 31
Example: Binomial Data, Beta PriorPossible scenarios of prior information on θ; posteriors with Z = 8:
I No idea of what θ could be: Beta(1,1)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
3.0
θ
PriorPosterior
I The person flipping the coin looks like a trickster: Beta(9,1)
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
θ
PriorPosterior
I Coin flipping usually has 50/50 chance of heads/tails: Beta(100,100)
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
θ
PriorPosterior
20 / 31
Example: Binomial Data, Beta PriorPossible scenarios of prior information on θ; posteriors with Z = 8:
I No idea of what θ could be: Beta(1,1)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
3.0
θ
PriorPosterior
I The person flipping the coin looks like a trickster: Beta(9,1)
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
θ
PriorPosterior
I Coin flipping usually has 50/50 chance of heads/tails: Beta(100,100)
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
θ
PriorPosterior
20 / 31
Comments So Far
I Bayesian approach allows you to incorporate side information basedon context
I Do you know what “flipping a coin” means?I Is the person flipping the coin someone you trust?
I It seems expressing prior information works nicely with someparametric families
I Quantification of prior information can be tricky, especially forcomplicated models
21 / 31
Comments So Far
I Bayesian approach allows you to incorporate side information basedon context
I Do you know what “flipping a coin” means?I Is the person flipping the coin someone you trust?
I It seems expressing prior information works nicely with someparametric families
I Quantification of prior information can be tricky, especially forcomplicated models
21 / 31
Comments So Far
I Bayesian approach allows you to incorporate side information basedon context
I Do you know what “flipping a coin” means?I Is the person flipping the coin someone you trust?
I It seems expressing prior information works nicely with someparametric families
I Quantification of prior information can be tricky, especially forcomplicated models
21 / 31
Example: Multinomial Data, Dirichlet PriorContinuing our example from the previous class:
I Let Zi = (Zi1,Zi2), Zi1,Zi2 ∈ {1, 2}, Zi ’s are i.i.d.,
p(Zi1 = k ,Zi2 = l | θ) = πkl
I θ = (. . . , πkl , . . . ), Wikl = I (Zi1 = k ,Zi2 = l)
I The likelihood of the study variables is
L(θ | z) =∏i
∏k,l
πWikl
kl
=∏k,l
πnklkl
wherenkl =
∑i
Wikl , k , l ∈ {1, 2}
22 / 31
Example: Multinomial Data, Dirichlet PriorContinuing our example from the previous class:
I Let Zi = (Zi1,Zi2), Zi1,Zi2 ∈ {1, 2}, Zi ’s are i.i.d.,
p(Zi1 = k ,Zi2 = l | θ) = πkl
I θ = (. . . , πkl , . . . ), Wikl = I (Zi1 = k ,Zi2 = l)
I The likelihood of the study variables is
L(θ | z) =∏i
∏k,l
πWikl
kl
=∏k,l
πnklkl
wherenkl =
∑i
Wikl , k , l ∈ {1, 2}
22 / 31
Example: Multinomial Data, Dirichlet Prior
I Inference on multinomial parameters is convenient using theDirichlet prior
I θ = (. . . , πkl , . . . ) ∼ Dirichlet(α), α = (. . . , αkl , . . . ),
p(θ) =Γ(∑αkl)∏
k,l Γ(αkl)
∏k,l
παkl−1kl
I The posterior is given by
p(θ | z) ∝ L(θ | z)p(θ)
∝∏k,l
πnkl+αkl−1kl
I Therefore, θ | z ∼ Dirichlet(α′), α′ = (. . . , αkl + nkl , . . . )
23 / 31
Example: Multinomial Data, Dirichlet Prior
I Inference on multinomial parameters is convenient using theDirichlet prior
I θ = (. . . , πkl , . . . ) ∼ Dirichlet(α), α = (. . . , αkl , . . . ),
p(θ) =Γ(∑αkl)∏
k,l Γ(αkl)
∏k,l
παkl−1kl
I The posterior is given by
p(θ | z) ∝ L(θ | z)p(θ)
∝∏k,l
πnkl+αkl−1kl
I Therefore, θ | z ∼ Dirichlet(α′), α′ = (. . . , αkl + nkl , . . . )
23 / 31
Comments on Priors
I There is no guarantee that the combination of arbitrary likelihoodsand priors will lead to posteriors that are easy to work with
I Conjugate priors are distributions that lead to posteriors in the samefamily, as in the previous examples – typically easier to work with,but not always available
I Non-conjugate priors can be used, but we require more advancedtechniques for handling them
I Lists of conjugate priors are available in multiple books and onlineresources
24 / 31
Comments on Priors
I There is no guarantee that the combination of arbitrary likelihoodsand priors will lead to posteriors that are easy to work with
I Conjugate priors are distributions that lead to posteriors in the samefamily, as in the previous examples – typically easier to work with,but not always available
I Non-conjugate priors can be used, but we require more advancedtechniques for handling them
I Lists of conjugate priors are available in multiple books and onlineresources
24 / 31
Comments on Priors
I There is no guarantee that the combination of arbitrary likelihoodsand priors will lead to posteriors that are easy to work with
I Conjugate priors are distributions that lead to posteriors in the samefamily, as in the previous examples – typically easier to work with,but not always available
I Non-conjugate priors can be used, but we require more advancedtechniques for handling them
I Lists of conjugate priors are available in multiple books and onlineresources
24 / 31
Comments on Priors
I There is no guarantee that the combination of arbitrary likelihoodsand priors will lead to posteriors that are easy to work with
I Conjugate priors are distributions that lead to posteriors in the samefamily, as in the previous examples – typically easier to work with,but not always available
I Non-conjugate priors can be used, but we require more advancedtechniques for handling them
I Lists of conjugate priors are available in multiple books and onlineresources
24 / 31
Example: Multivariate Normal
I Distribution of the data
Z = {Zi}ni=1 | µ,Λi.i.d.∼ Normal(µ,Λ−1)
where Zi ∈ RK , µ is the vector of means, Λ−1 is the covariancematrix, and Λ is the inverse covariance matrix (the precision matrix)
I Conjugate prior is constructed in two steps
µ | Λ ∼ Normal(µ0, (κ0Λ)−1)
Λ ∼Wishart(υ0,W0)
Joint distribution of (µ,Λ) is called Normal-Wishart. Theparameterization is such that E (Λ) = υ0W0.
25 / 31
Example: Multivariate Normal
I Distribution of the data
Z = {Zi}ni=1 | µ,Λi.i.d.∼ Normal(µ,Λ−1)
where Zi ∈ RK , µ is the vector of means, Λ−1 is the covariancematrix, and Λ is the inverse covariance matrix (the precision matrix)
I Conjugate prior is constructed in two steps
µ | Λ ∼ Normal(µ0, (κ0Λ)−1)
Λ ∼Wishart(υ0,W0)
Joint distribution of (µ,Λ) is called Normal-Wishart. Theparameterization is such that E (Λ) = υ0W0.
25 / 31
Example: Multivariate Normal
Posterior is also Normal-Wishart
µ | Λ, z ∼ Normal(µ′, (κ′Λ)−1)
Λ | z ∼Wishart(υ′,W ′)
where
µ′ = (κ0µ0 + nz̄)/κ′
κ′ = κ0 + n
υ′ = υ0 + n
W ′ = {W−10 + n[Σ̂ +
κ0
κ′(z̄ − µ0)(z̄ − µ0)T ]}−1
z̄ =n∑
i=1
zi/n
Σ̂ =n∑
i=1
(zi − z̄)(zi − z̄)T/n
26 / 31
Bayesian Point Estimation
I L(θ, θ′): loss of estimating a parameter to be θ′ when the true valueis θ
I Bayes estimator minimizes the expected posterior loss
θ̂Bayes = arg minθ′
∫L(θ, θ′)p(θ | z)dθ
I For univariate θ it is common to choose
I L(θ, θ′) = (θ − θ′)2 =⇒ θ̂Bayes is posterior mean
I L(θ, θ′) = |θ − θ′| =⇒ θ̂Bayes is posterior median
I L(θ, θ′) = I (θ 6= θ′) =⇒ θ̂Bayes is posterior mode
27 / 31
Bayesian Point Estimation
I L(θ, θ′): loss of estimating a parameter to be θ′ when the true valueis θ
I Bayes estimator minimizes the expected posterior loss
θ̂Bayes = arg minθ′
∫L(θ, θ′)p(θ | z)dθ
I For univariate θ it is common to choose
I L(θ, θ′) = (θ − θ′)2 =⇒ θ̂Bayes is posterior mean
I L(θ, θ′) = |θ − θ′| =⇒ θ̂Bayes is posterior median
I L(θ, θ′) = I (θ 6= θ′) =⇒ θ̂Bayes is posterior mode
27 / 31
Bayesian Point Estimation
I L(θ, θ′): loss of estimating a parameter to be θ′ when the true valueis θ
I Bayes estimator minimizes the expected posterior loss
θ̂Bayes = arg minθ′
∫L(θ, θ′)p(θ | z)dθ
I For univariate θ it is common to choose
I L(θ, θ′) = (θ − θ′)2 =⇒ θ̂Bayes is posterior mean
I L(θ, θ′) = |θ − θ′| =⇒ θ̂Bayes is posterior median
I L(θ, θ′) = I (θ 6= θ′) =⇒ θ̂Bayes is posterior mode
27 / 31
Bayesian Credible Sets/Intervals
I C is a (1− α)100% credible set if∫C
p(θ | z)dθ ≥ 1− α
I For univariate θ, define the (1− α)100% credible interval C as theinterval within the α/2 and 1− α/2 quantiles of the posteriorp(θ | z)
28 / 31
Bayesian Credible Sets/Intervals
I C is a (1− α)100% credible set if∫C
p(θ | z)dθ ≥ 1− α
I For univariate θ, define the (1− α)100% credible interval C as theinterval within the α/2 and 1− α/2 quantiles of the posteriorp(θ | z)
28 / 31
Asymptotic Behavior of Posteriors: Bernstein - von Mises
Bernstein - von Mises theorem1
I Under some conditions, the posterior distribution asymptoticallybehaves like the sampling distribution of the MLE
I Heuristically, we say
p(θ | z) ≈ N (θ̂MLE, I(θ̂MLE)−1/n)
I Therefore, for well-behaved models and with a good amount of data,Bayesian and frequentist inferences will be similar
1For details, see lecture notes of Richard Nickl:http://www.statslab.cam.ac.uk/~nickl/Site/__files/stat2013.pdf
29 / 31
Asymptotic Behavior of Posteriors: Bernstein - von Mises
Bernstein - von Mises theorem1
I Under some conditions, the posterior distribution asymptoticallybehaves like the sampling distribution of the MLE
I Heuristically, we say
p(θ | z) ≈ N (θ̂MLE, I(θ̂MLE)−1/n)
I Therefore, for well-behaved models and with a good amount of data,Bayesian and frequentist inferences will be similar
1For details, see lecture notes of Richard Nickl:http://www.statslab.cam.ac.uk/~nickl/Site/__files/stat2013.pdf
29 / 31
Asymptotic Behavior of Posteriors: Bernstein - von Mises
Bernstein - von Mises theorem1
I Under some conditions, the posterior distribution asymptoticallybehaves like the sampling distribution of the MLE
I Heuristically, we say
p(θ | z) ≈ N (θ̂MLE, I(θ̂MLE)−1/n)
I Therefore, for well-behaved models and with a good amount of data,Bayesian and frequentist inferences will be similar
1For details, see lecture notes of Richard Nickl:http://www.statslab.cam.ac.uk/~nickl/Site/__files/stat2013.pdf
29 / 31
Missing Data and BayesI With missing data, things get complicated
Lobs(θ, ψ | z(r), r) =n∏
i=1
∫Z(r̄i )
p(ri | zi , ψ)p(zi | θ) dzi(r̄i )
MAR=
[n∏
i=1
p(ri | zi(ri ), ψ)
]︸ ︷︷ ︸
Can be ignored
[n∏
i=1
∫Z(r̄i )
p(zi | θ) dzi(r̄i )
]︸ ︷︷ ︸
Likelihood for θ under MAR
I Under a Bayesian approach, we need to obtain
p(θ, ψ | z(r), r) ∝ Lobs(θ, ψ | z(r), r)p(θ, ψ)
I Under ignorability (MAR + separability + θ ⊥⊥ ψ a priori), we need
p(θ | z(r)) ∝ Lobs(θ | z(r))p(θ)
I Integrals in Lobs(θ | z(r)) complicate things – what to do?
30 / 31
Missing Data and BayesI With missing data, things get complicated
Lobs(θ, ψ | z(r), r) =n∏
i=1
∫Z(r̄i )
p(ri | zi , ψ)p(zi | θ) dzi(r̄i )
MAR=
[n∏
i=1
p(ri | zi(ri ), ψ)
]︸ ︷︷ ︸
Can be ignored
[n∏
i=1
∫Z(r̄i )
p(zi | θ) dzi(r̄i )
]︸ ︷︷ ︸
Likelihood for θ under MAR
I Under a Bayesian approach, we need to obtain
p(θ, ψ | z(r), r) ∝ Lobs(θ, ψ | z(r), r)p(θ, ψ)
I Under ignorability (MAR + separability + θ ⊥⊥ ψ a priori), we need
p(θ | z(r)) ∝ Lobs(θ | z(r))p(θ)
I Integrals in Lobs(θ | z(r)) complicate things – what to do?
30 / 31
Missing Data and BayesI With missing data, things get complicated
Lobs(θ, ψ | z(r), r) =n∏
i=1
∫Z(r̄i )
p(ri | zi , ψ)p(zi | θ) dzi(r̄i )
MAR=
[n∏
i=1
p(ri | zi(ri ), ψ)
]︸ ︷︷ ︸
Can be ignored
[n∏
i=1
∫Z(r̄i )
p(zi | θ) dzi(r̄i )
]︸ ︷︷ ︸
Likelihood for θ under MAR
I Under a Bayesian approach, we need to obtain
p(θ, ψ | z(r), r) ∝ Lobs(θ, ψ | z(r), r)p(θ, ψ)
I Under ignorability (MAR + separability + θ ⊥⊥ ψ a priori), we need
p(θ | z(r)) ∝ Lobs(θ | z(r))p(θ)
I Integrals in Lobs(θ | z(r)) complicate things – what to do?
30 / 31
Missing Data and BayesI With missing data, things get complicated
Lobs(θ, ψ | z(r), r) =n∏
i=1
∫Z(r̄i )
p(ri | zi , ψ)p(zi | θ) dzi(r̄i )
MAR=
[n∏
i=1
p(ri | zi(ri ), ψ)
]︸ ︷︷ ︸
Can be ignored
[n∏
i=1
∫Z(r̄i )
p(zi | θ) dzi(r̄i )
]︸ ︷︷ ︸
Likelihood for θ under MAR
I Under a Bayesian approach, we need to obtain
p(θ, ψ | z(r), r) ∝ Lobs(θ, ψ | z(r), r)p(θ, ψ)
I Under ignorability (MAR + separability + θ ⊥⊥ ψ a priori), we need
p(θ | z(r)) ∝ Lobs(θ | z(r))p(θ)
I Integrals in Lobs(θ | z(r)) complicate things – what to do?
30 / 31
SummaryMain take-aways from today’s lecture:
I Using Bayes’ theorem doesn’t make you a Bayesian
I Bayesian inference offers alternative framework for derivinginferences from data
I Philosophical motivation: inclusion of prior belief or knowledge,uncertainty quantification in terms of distributions for parameters
I Practical motivation: convenient in some problems, might lead togood frequentist performance
I Complex problems are computationally challenging – posterior needsto be approximated (e.g., Markov chain Monte Carlo)
Next lecture:
I Gibbs sampling
I Data augmentation
I Introduction to multiple imputation31 / 31
SummaryMain take-aways from today’s lecture:
I Using Bayes’ theorem doesn’t make you a Bayesian
I Bayesian inference offers alternative framework for derivinginferences from data
I Philosophical motivation: inclusion of prior belief or knowledge,uncertainty quantification in terms of distributions for parameters
I Practical motivation: convenient in some problems, might lead togood frequentist performance
I Complex problems are computationally challenging – posterior needsto be approximated (e.g., Markov chain Monte Carlo)
Next lecture:
I Gibbs sampling
I Data augmentation
I Introduction to multiple imputation31 / 31