ec 6310: advanced econometric theorypersonal.strath.ac.uk/gary.koop/ec6310_lecture3.pdf · step6:...
TRANSCRIPT
EC 6310:
Advanced Econometric Theory
July 2008
Slides for Lecture on
Bayesian Computation in the NonlinearRegression Model
Gary Koop, University of Strathclyde
1 Summary
� Readings: Chapter 5 of textbook.
� Nonlinear regression model is of interest in own right,but also will allow us to introduce some widely usefulBayesian computational tools
� Metropolis-Hastings algorithms (a way of doing pos-terior simulation).
� Posterior predictive p-values (a way of comparingmodels which does not involve marginal likelihoods).
� Gelfand-Dey method of marginal likelihood calcula-tion.
2 The Nonlinear Regression Model
� Researchers typically work with linear regression model:
yi = �1 + �2xi2 + :::+ �kxik + "i;
� In some cases nonlinear models can be made linearby transformation.
� For instance:
y = �1x�22 ::x
�kk :
can be logged to produce linear functional form:
ln (yi) = �1 + �2 ln (xi2) + :::+ �k ln (xik) + "i;
where �1 = ln (�1).
� But some functional forms are intrinsically nonlinear
� E.g. constant elasticity of substitution (CES) pro-duction function:
yi =
0@ kXj=1
jx k+1ij
1A 1 k+1
:
� No way to transform CES to make linear.
� Nonlinear regression model:
yi =
0@ kXj=1
jx k+1ij
1A 1 k+1
+ "i:
� General form:
y = f (X; ) + ";
where y;X and " are de�ned as in linear regression model(i.e. " is N(0N ; h
�1IN))
� f (X; ) is an N -vector of functions
� Properties of Normal distribution gives us likelihoodfunction:
p(yj ; h) =hN2
(2�)N2
nexp
h�h2 fy � f (X; )g
0 fy � f (X; )gio :
� Prior: any can be used. so let us just call it p ( ; h)
� Posterior is proportional to likelihood times prior:
p( ; hjy) / p ( ; h)hN2
(2�)N2
nexp
h�h2 fy � f (X; )g
0 fy � f (X; )gio
� No way to simplify this expression or recognize it ashaving a familiar form for (e.g. it is not Normal ort-distribution, etc.).
� How to do posterior simulation? Importance sam-pling is one possibility, but here we introduce an-other: Metropolis-Hastings
3 The Metropolis-Hastings Algorithm
� Notation: � is a vector of parameters and p (yj�) ; p (�)and p (�jy) are the likelihood, prior and posterior, re-spectively.
� Metropolis-Hastings algorithm takes draws from aconvenient candidate generating density. Let �� in-dicate a draw taken from this density which we de-note as q
��(s�1); �
�.
� Notation: �� is a draw taken of the random variable� whose density depends on �(s�1).
� Notation: like the Gibbs sampler (but unlike impor-tance sampling), the current draw depends on theprevious draw. A "chain of draws" is produced.Thus, "Markov Chain Monte Carlo (MCMC)".
� Importance sampling corrects for the fact that theimportance function di¤ered from the posterior byweighting the draws di¤erently from one another.With Metropolis-Hastings, we weight all draws equally,but not all the candidate draws are accepted.
The Metropolis-Hastings algorithm always takes the fol-lowing form:
Step 1: Choose a starting value, �(0).
Step 2: Take a candidate draw, �� from the candidategenerating density, q
��(s�1); �
�.
Step 3: Calculate an acceptance probability, ���(s�1); ��
�.
Step 4: Set �(s) = �� with probability ���(s�1); ��
�and set �(s) = �(s�1) with probability 1��
��(s�1); ��
�:
Step 5: Repeat Steps 1, 2 and 3 S times.
Step 6: Take the average of the S draws g��(1)
�; :::; g
��(S)
�.
These steps will yield an estimate of E [g(�)jy] for anyfunction of interest.
� Note: As with Gibbs sampling, the Metropolis-Hastingsalgorithm usually requires the choice of a startingvalue, �(0). To make sure that the e¤ect of thisstarting value has vanished, it is usually wise to dis-card S0 initial draws.
� Intuition for acceptance probability, ���(s�1); ��
�,
given in textbook (pages 93-94).
����(s�1); ��
�=
min
24 p(�=��jy)q���;�=�(s�1)
�p��=�(s�1)jy
�q��(s�1);�=��
�; 135 :
3.1 The Independence Chain Metropolis-
Hastings Algorithm
� The Independence Chain Metropolis-Hastings algo-rithm uses a candidate generating density which isindependent across draws. That is, q
��(s�1); �
�=
q� (�) and the candidate generating density does notdepend on �(s�1).
� Useful in cases where a convenient approximation ex-ists to the posterior. This convenient approximationcan be used as a candidate generating density.
� Acceptance probability simpli�es to:
���(s�1); ��
�= min
24p (� = ��jy) q��� = �(s�1)
�p�� = �(s�1)jy
�q� (� = ��)
; 1
35 :
� The independence chain Metropolis-Hastings algo-rithm is closely related to importance sampling. Thiscan be seen by noting that, if we de�ne weightsanalogous to the importance sampling weights (seeChapter 4, equation 4.38):
w��A�=p�� = �Ajy
�q�(� = �A)
;
the acceptance probability in (5.9) can be written as:
���(s�1); ��
�= min
24 w (��)
w��(s�1)
�; 135 :
In words, the acceptance probability is simply the ratioof importance sampling weights evaluated at the old andcandidate draws.
� Setting q� (�) = fN �jb�ML;
\var
�b�ML
�!can work
well in some cases whereML denotes maximum like-lihood estimates. See textbook pages 95-97 for moredetail on choosing candidate generating densities.
3.2 The Random Walk Chain Metropolis-
Hastings Algorithm
� The Random Walk Chain Metropolis-Hastings algo-rithm is useful when you cannot �nd a good approx-imating density for the posterior.
� No attempt made to approximate posterior, rathercandidate generating density is chosen to wanderwidely, taking draws proportionately in various re-gions of the posterior.
� Generates candidate draws according to:
�� = �(s�1) + z;
where z is called the increment random variable.
� The acceptance probability simpli�es to:
���(s�1); ��
�= min
24 p (� = ��jy)p�� = �(s�1)jy
�; 135
� Choice of density for z determines form of candidategenerating density.
� Common choice is Normal. �(s�1) is the mean andresearcher must choose covariance matrix (�)
q��(s�1); �
�= fN(�j�(s�1);�):
� Researcher must select �. Should be selected sothat the acceptance probability tends to be neithertoo high nor too low.
� There is no general rule which gives the optimal ac-ceptance rate. A rule of thumb is that the accep-tance probability should be roughly 0:5.
� A common approach is to to set � = c where c isa scalar and is an estimate of posterior covariancematrix of �. You can experiment with di¤erent val-ues of c until you �nd one which yields reasonableacceptance probability.
� This approach requires �nding , an estimate ofvar (�jy) (e.g. \
var�b�ML
�)
3.3 Metropolis-within-Gibbs
� Remember: the Gibbs sampler involved sequentiallydrawing from p
��(1)jy; �(2)
�and p
��(2)jy; �(1)
�.
� Using a Metropolis-Hastings algorithm for either (orboth) of the posterior conditionals used in the Gibbssampler, p
��(1)jy; �(2)
�and p
��(2)jy; �(1)
�, is per-
fectly acceptable.
� This statement is also true if the Gibbs sampler in-volves more than two blocks.
� Such Metropolis-within-Gibbs algorithms are com-mon since many models have posteriors where mostof the conditionals are easy to draw from, but oneor two conditionals do not have convenient form.
4 A Measure of Model Fit: The
Posterior Predictive P-Value
� Bayesians usually use marginal likelihoods/Bayes fac-tors/marginal likelihoods to compare models
� But these can be sensitive to choice of prior andoften cannot be used with noninformative priors.
� Also, they can only be used to compare models rel-ative to each other (e.g. �Model 1 is better thanModel 2�).
� Cannot be used as diagnostics of absolute model per-formance (e.g. cannot say �Model 1 is �tting well�)
� Posterior predictive p-value okay with noninformativepriors and absolute measure of performance
� Notation: y is data actually observed, and yy, ob-servable data which could be generated from modelunder study
� g (:) is function of interest.
� Its posterior, p(g(yy)jy) summarizes everything ourmodel says about g(yy) after seeing the data.
� Tells us the types of data sets that our model cangenerate.
� Can calculate g (y).
� If g(y) is in extreme tails of p(g(yy)jy), then g (y) isnot the sort of data characteristic that can plausiblybe generated by the model.
� Formally, tail area probabilities similar to frequentistp-value calculations can be obtained.
� Posterior predictive p-value is the probability of amodel yielding a data set more than g (y)
� To get p(g(yy)jy) use simulation methods similar topredictive simulation
� Draw from posterior, then simulate y� at each draw
5 Example: Posterior Predictive P-
values in Nonlinear Regression Model
� Need to choose function of interest, g (:).
� Example:
yyi = f (Xi; ) + "i;
� We have assumed Normal errors. Is this a good as-sumption?
� Normal errors imply skewness and kurtosis measuresbelow are zero:
Skew =
pNPNi=1 "
3ihPN
i=1 "2i
i32
Kurt =NPNi=1 "
4ihPN
i=1 "2i
i2 � 3:
� Use these as our functions of interest
� g (y) = E [Skewjy] or E [Kurtjy] and g�yy�=
EhSkewjyy
ior E
hKurtjyy
i.
� Can show (by integrating out h) that
p�yyj
�= ft
�yyjf (X; ) ; s2IN ; N
�; (*)
where
s2 =[y � f (X; )]0 [y � f (X; )]
N:
� A program for doing this for Skew has followingform (Kurt is similar).
� Step 1: Take a draw, (s); using the posterior sim-ulator.
� Step 2: Generate a representative data set, yy(s),from p
�yyj (s)
�using (*)
� Step 3: Set "(s)i = yi�f�Xi;
(s)�for i = 1; ::; N
and evaluate Skew(s).
� Step 4: Set "y(s)i = yy(s)i � f
�Xi;
(s)�for i =
1; ::; N and evaluate Skewy(s).
� Step 5: Repeat Steps 1, 2, 3 and 4 S times.
� Step 6: Take the average of the S draws Skew(1); :::; Skew(S)to get E [Skewjy].
� Step 7: Calculate the proportion of the S drawsSkewy(1); :::; Skewy(S) which are smaller than yourestimate of E [Skewjy] from Step 6.
� If Step 7 less than 0:5, this is posterior predictivep-value. Otherwise it is one minus this number.
� If posterior predictive p-value is less than 0:05 (or0:01), the this is evidence against a model (i.e. thismodel is unlikely to have generated data sets of thesort that was observed).
5.1 Example
� Textbook pages 107-111 has an empirical examplewith nonlinear regression model (CES production func-tion)
� For skewness yields a posterior predictive p-value of0.37
� For kurtosis yields a posterior predictive p-value of0.38
� Evidence that this model is �tting these features ofthe data well.
� See �gures
6 Calculating Marginal Likelihoods:
The Gelfand-Dey Method
� Other main method of model comparison (posteriorodds/Bayes factors) based on marginal likelihoods
� Marginal likelihoods can be hard to calculate
� Sometimes can work out analytical formula (e.g. Nor-mal linear regression model with natural conjugateprior).
� If one model is nested inside another, Savage-Dickeydensity ratio can be used.
� But with nonlinear regression model, may wish tocompare di¤erent choices for f (:): non-nested
� There are a few methods which use posterior sim-ulator output to calculate marginal likelihoods forgeneral cases
� Gelfand-Dey is one such method
� Idea: inverse of the marginal likelihood for a model,Mi, which depends on parameter vector, �, can bewritten as E [g (�) jy;Mi] for a particular choice ofg (:).
� Posterior simulators such as Gibbs sampler or Metropolis-Hastings designed precisely to estimate such quanti-ties.
� Theorem 5.1: The Gelfand-Dey Method of MarginalLikelihood Calculation
Let p (�jMi) ; p (yj�;Mi) and p (�jy;Mi) denote the prior,likelihood and posterior, respectively, for model Mi de-�ned on the region �. If f (�) is any p.d.f. with supportcontained in �, then
E
"f (�)
p (�jMi) p (yj�;Mi)jy;Mi
#=
1
p (yjMi):
Proof: see textbook page 105
� Theorem says for any p.d.f. f (�), we can simplyset:
g (�) =f (�)
p (�jMi) p (yj�;Mi)
and use posterior simulator output to estimateE [g (�) jy;Mi]
� Even f (�) = 1 works (in theory)
� But, to work well in practice, f (�) must be chosenvery carefully.
� Theory says it converges best if f(�)p(�jMi)p(yj�;Mi)
bounded.
� In practice, p (�jMi) p (yj�;Mi) can be near zero intails of posterior
� One strategy: let f (:) be a Normal density similarto posterior, but with the tails chopped o¤.
� Let b� and b� be estimates ofE (�jy;Mi) and var (�jy;Mi)
obtained from the posterior simulator.
� For some probability, p 2 (0; 1), let b� denote thesupport of f (�) which is de�ned by
b� = �� :
�b� � ��0 b��1 �b� � �� � �21�p (k)� ;� In words: chop o¤ tails with p probability in them
� Let f (�) be this Normal density density truncatedto the region b�