ec 6310: advanced econometric theorypersonal.strath.ac.uk/gary.koop/ec6310_lecture3.pdf · step6:...

EC 6310:

Advanced Econometric Theory

July 2008

Slides for Lecture on

Bayesian Computation in the NonlinearRegression Model

Gary Koop, University of Strathclyde

1 Summary

� Readings: Chapter 5 of textbook.

� Nonlinear regression model is of interest in own right,but also will allow us to introduce some widely usefulBayesian computational tools

� Metropolis-Hastings algorithms (a way of doing pos-terior simulation).

� Posterior predictive p-values (a way of comparingmodels which does not involve marginal likelihoods).

� Gelfand-Dey method of marginal likelihood calcula-tion.

2 The Nonlinear Regression Model

� Researchers typically work with linear regression model:

yi = �1 + �2xi2 + :::+ �kxik + "i;

� In some cases nonlinear models can be made linearby transformation.

� For instance:

y = �1x�22 ::x

�kk :

can be logged to produce linear functional form:

ln (yi) = �1 + �2 ln (xi2) + :::+ �k ln (xik) + "i;

where �1 = ln (�1).

� But some functional forms are intrinsically nonlinear

� E.g. constant elasticity of substitution (CES) pro-duction function:

yi =

0@ kXj=1

jx k+1ij

1A 1 k+1

:

� No way to transform CES to make linear.

� Nonlinear regression model:

yi =

0@ kXj=1

jx k+1ij

1A 1 k+1

+ "i:

� General form:

y = f (X; ) + ";

where y;X and " are de�ned as in linear regression model(i.e. " is N(0N ; h

�1IN))

� f (X; ) is an N -vector of functions

� Properties of Normal distribution gives us likelihoodfunction:

p(yj ; h) =hN2

(2�)N2

nexp

h�h2 fy � f (X; )g

0 fy � f (X; )gio :

� Prior: any can be used. so let us just call it p ( ; h)

� Posterior is proportional to likelihood times prior:

p( ; hjy) / p ( ; h)hN2

(2�)N2

nexp

h�h2 fy � f (X; )g

0 fy � f (X; )gio

� No way to simplify this expression or recognize it ashaving a familiar form for (e.g. it is not Normal ort-distribution, etc.).

� How to do posterior simulation? Importance sam-pling is one possibility, but here we introduce an-other: Metropolis-Hastings

3 The Metropolis-Hastings Algorithm

� Notation: � is a vector of parameters and p (yj�) ; p (�)and p (�jy) are the likelihood, prior and posterior, re-spectively.

� Metropolis-Hastings algorithm takes draws from aconvenient candidate generating density. Let �� in-dicate a draw taken from this density which we de-note as q

��(s�1); �

�.

� Notation: �� is a draw taken of the random variable� whose density depends on �(s�1).

� Notation: like the Gibbs sampler (but unlike impor-tance sampling), the current draw depends on theprevious draw. A "chain of draws" is produced.Thus, "Markov Chain Monte Carlo (MCMC)".

� Importance sampling corrects for the fact that theimportance function di¤ered from the posterior byweighting the draws di¤erently from one another.With Metropolis-Hastings, we weight all draws equally,but not all the candidate draws are accepted.

The Metropolis-Hastings algorithm always takes the fol-lowing form:

Step 1: Choose a starting value, �(0).

Step 2: Take a candidate draw, �� from the candidategenerating density, q

��(s�1); �

�.

Step 3: Calculate an acceptance probability, ��(s�1); ��

�.

Step 4: Set �(s) = �� with probability ��(s�1); ��

�and set �(s) = �(s�1) with probability 1��

��(s�1); ��

�:

Step 5: Repeat Steps 1, 2 and 3 S times.

Step 6: Take the average of the S draws g��(1)

�; :::; g

��(S)

�.

These steps will yield an estimate of E [g(�)jy] for anyfunction of interest.

� Note: As with Gibbs sampling, the Metropolis-Hastingsalgorithm usually requires the choice of a startingvalue, �(0). To make sure that the e¤ect of thisstarting value has vanished, it is usually wise to dis-card S0 initial draws.

� Intuition for acceptance probability, ��(s�1); ��

�,

given in textbook (pages 93-94).

��(s�1); ��

�=

min

24 p(�=��jy)q��;�=�(s�1)

�p��=�(s�1)jy

�q��(s�1);�=��

�; 135 :

3.1 The Independence Chain Metropolis-

Hastings Algorithm

� The Independence Chain Metropolis-Hastings algo-rithm uses a candidate generating density which isindependent across draws. That is, q

��(s�1); �

�=

q� (�) and the candidate generating density does notdepend on �(s�1).

� Useful in cases where a convenient approximation ex-ists to the posterior. This convenient approximationcan be used as a candidate generating density.

� Acceptance probability simpli�es to:

��(s�1); ��

�= min

24p (� = ��jy) q�� = �(s�1)

�p�� = �(s�1)jy

�q� (� = ��)

; 1

35 :

� The independence chain Metropolis-Hastings algo-rithm is closely related to importance sampling. Thiscan be seen by noting that, if we de�ne weightsanalogous to the importance sampling weights (seeChapter 4, equation 4.38):

w��A�=p�� = �Ajy

�q�(� = �A)

;

the acceptance probability in (5.9) can be written as:

��(s�1); ��

�= min

24 w (��)

w��(s�1)

�; 135 :

In words, the acceptance probability is simply the ratioof importance sampling weights evaluated at the old andcandidate draws.

� Setting q� (�) = fN �jb�ML;

\var

�b�ML

�!can work

well in some cases whereML denotes maximum like-lihood estimates. See textbook pages 95-97 for moredetail on choosing candidate generating densities.

3.2 The Random Walk Chain Metropolis-

Hastings Algorithm

� The Random Walk Chain Metropolis-Hastings algo-rithm is useful when you cannot �nd a good approx-imating density for the posterior.

� No attempt made to approximate posterior, rathercandidate generating density is chosen to wanderwidely, taking draws proportionately in various re-gions of the posterior.

� Generates candidate draws according to:

�� = �(s�1) + z;

where z is called the increment random variable.

� The acceptance probability simpli�es to:

��(s�1); ��

�= min

24 p (� = ��jy)p�� = �(s�1)jy

�; 135

� Choice of density for z determines form of candidategenerating density.

� Common choice is Normal. �(s�1) is the mean andresearcher must choose covariance matrix (�)

q��(s�1); �

�= fN(�j�(s�1);�):

� Researcher must select �. Should be selected sothat the acceptance probability tends to be neithertoo high nor too low.

� There is no general rule which gives the optimal ac-ceptance rate. A rule of thumb is that the accep-tance probability should be roughly 0:5.

� A common approach is to to set � = c where c isa scalar and is an estimate of posterior covariancematrix of �. You can experiment with di¤erent val-ues of c until you �nd one which yields reasonableacceptance probability.

� This approach requires �nding , an estimate ofvar (�jy) (e.g. \

var�b�ML

�)

3.3 Metropolis-within-Gibbs

� Remember: the Gibbs sampler involved sequentiallydrawing from p

��(1)jy; �(2)

�and p

��(2)jy; �(1)

�.

� Using a Metropolis-Hastings algorithm for either (orboth) of the posterior conditionals used in the Gibbssampler, p

��(1)jy; �(2)

�and p

��(2)jy; �(1)

�, is per-

fectly acceptable.

� This statement is also true if the Gibbs sampler in-volves more than two blocks.

� Such Metropolis-within-Gibbs algorithms are com-mon since many models have posteriors where mostof the conditionals are easy to draw from, but oneor two conditionals do not have convenient form.

4 A Measure of Model Fit: The

Posterior Predictive P-Value

� Bayesians usually use marginal likelihoods/Bayes fac-tors/marginal likelihoods to compare models

� But these can be sensitive to choice of prior andoften cannot be used with noninformative priors.

� Also, they can only be used to compare models rel-ative to each other (e.g. �Model 1 is better thanModel 2�).

� Cannot be used as diagnostics of absolute model per-formance (e.g. cannot say �Model 1 is �tting well�)

� Posterior predictive p-value okay with noninformativepriors and absolute measure of performance

� Notation: y is data actually observed, and yy, ob-servable data which could be generated from modelunder study

� g (:) is function of interest.

� Its posterior, p(g(yy)jy) summarizes everything ourmodel says about g(yy) after seeing the data.

� Tells us the types of data sets that our model cangenerate.

� Can calculate g (y).

� If g(y) is in extreme tails of p(g(yy)jy), then g (y) isnot the sort of data characteristic that can plausiblybe generated by the model.

� Formally, tail area probabilities similar to frequentistp-value calculations can be obtained.

� Posterior predictive p-value is the probability of amodel yielding a data set more than g (y)

� To get p(g(yy)jy) use simulation methods similar topredictive simulation

� Draw from posterior, then simulate y� at each draw

5 Example: Posterior Predictive P-

values in Nonlinear Regression Model

� Need to choose function of interest, g (:).

� Example:

yyi = f (Xi; ) + "i;

� We have assumed Normal errors. Is this a good as-sumption?

� Normal errors imply skewness and kurtosis measuresbelow are zero:

Skew =

pNPNi=1 "

3ihPN

i=1 "2i

i32

Kurt =NPNi=1 "

4ihPN

i=1 "2i

i2 � 3:

� Use these as our functions of interest

� g (y) = E [Skewjy] or E [Kurtjy] and g�yy�=

EhSkewjyy

ior E

hKurtjyy

i.

� Can show (by integrating out h) that

p�yyj

�= ft

�yyjf (X; ) ; s2IN ; N

�; (*)

where

s2 =[y � f (X; )]0 [y � f (X; )]

N:

� A program for doing this for Skew has followingform (Kurt is similar).

� Step 1: Take a draw, (s); using the posterior sim-ulator.

� Step 2: Generate a representative data set, yy(s),from p

�yyj (s)

�using (*)

� Step 3: Set "(s)i = yi�f�Xi;

(s)�for i = 1; ::; N

and evaluate Skew(s).

� Step 4: Set "y(s)i = yy(s)i � f

�Xi;

(s)�for i =

1; ::; N and evaluate Skewy(s).

� Step 5: Repeat Steps 1, 2, 3 and 4 S times.

� Step 6: Take the average of the S draws Skew(1); :::; Skew(S)to get E [Skewjy].

� Step 7: Calculate the proportion of the S drawsSkewy(1); :::; Skewy(S) which are smaller than yourestimate of E [Skewjy] from Step 6.

� If Step 7 less than 0:5, this is posterior predictivep-value. Otherwise it is one minus this number.

� If posterior predictive p-value is less than 0:05 (or0:01), the this is evidence against a model (i.e. thismodel is unlikely to have generated data sets of thesort that was observed).

5.1 Example

� Textbook pages 107-111 has an empirical examplewith nonlinear regression model (CES production func-tion)

� For skewness yields a posterior predictive p-value of0.37

� For kurtosis yields a posterior predictive p-value of0.38

� Evidence that this model is �tting these features ofthe data well.

� See �gures

6 Calculating Marginal Likelihoods:

The Gelfand-Dey Method

� Other main method of model comparison (posteriorodds/Bayes factors) based on marginal likelihoods

� Marginal likelihoods can be hard to calculate

� Sometimes can work out analytical formula (e.g. Nor-mal linear regression model with natural conjugateprior).

� If one model is nested inside another, Savage-Dickeydensity ratio can be used.

� But with nonlinear regression model, may wish tocompare di¤erent choices for f (:): non-nested

� There are a few methods which use posterior sim-ulator output to calculate marginal likelihoods forgeneral cases

� Gelfand-Dey is one such method

� Idea: inverse of the marginal likelihood for a model,Mi, which depends on parameter vector, �, can bewritten as E [g (�) jy;Mi] for a particular choice ofg (:).

� Posterior simulators such as Gibbs sampler or Metropolis-Hastings designed precisely to estimate such quanti-ties.

� Theorem 5.1: The Gelfand-Dey Method of MarginalLikelihood Calculation

Let p (�jMi) ; p (yj�;Mi) and p (�jy;Mi) denote the prior,likelihood and posterior, respectively, for model Mi de-�ned on the region �. If f (�) is any p.d.f. with supportcontained in �, then

E

"f (�)

p (�jMi) p (yj�;Mi)jy;Mi

#=

1

p (yjMi):

Proof: see textbook page 105

� Theorem says for any p.d.f. f (�), we can simplyset:

g (�) =f (�)

p (�jMi) p (yj�;Mi)

and use posterior simulator output to estimateE [g (�) jy;Mi]

� Even f (�) = 1 works (in theory)

� But, to work well in practice, f (�) must be chosenvery carefully.

� Theory says it converges best if f(�)p(�jMi)p(yj�;Mi)

bounded.

� In practice, p (�jMi) p (yj�;Mi) can be near zero intails of posterior

� One strategy: let f (:) be a Normal density similarto posterior, but with the tails chopped o¤.

� Let b� and b� be estimates ofE (�jy;Mi) and var (�jy;Mi)

obtained from the posterior simulator.

� For some probability, p 2 (0; 1), let b� denote thesupport of f (�) which is de�ned by

b� = �� :

�b� � ��0 b��1 �b� � �� 21�p (k)� ;� In words: chop o¤ tails with p probability in them

� Let f (�) be this Normal density density truncatedto the region b�

ec 6310: advanced econometric theorypersonal.strath.ac.uk/gary.koop/ec6310_lecture3.pdf · step6:...

Documents