Download - [Smith, Kohn, 2000] Nonparametric Seemingly Unrelated Regression

8/13/2019 [Smith, Kohn, 2000] Nonparametric Seemingly Unrelated Regression

1/25

*Corresponding author. Tel.:#612-9351-2787; fax:#612-9351-6409.

E-mail address:[email protected] (M. Smith).

Journal of Econometrics 98 (2000) 257}281

Nonparametric seemingly unrelated regression

Michael Smith*, Robert Kohn

Econometrics and Business Statistics, University of Sydney, Sydney, Australia

Australian Graduate School of Management, University of New South Wales, Kensington, New SouthWales, Australia

Received 19 March 1998; received in revised form 9 February 2000; accepted 13 March 2000

Abstract

A method is presented for simultaneously estimating a system of nonparametric

regressions which may seem unrelated, but where the errors are potentially correlatedbetween equations. We show that the advantage of estimating such a &seemingly unre-

lated' system of nonparametric regressions is that less observations can be required to

obtain reliable function estimates than if each of the regression equations is estimated

separately and the correlation ignored. This increase in e$ciency is investigated empiric-

ally using both simulated and real data. The method uses a Bayesian hierarchical

framework where each regression function is represented as a linear combination of

a large number of basis terms. All the regression coe$cients, and the variance matrix of

the errors, are estimated simultaneously by their posterior means. The computation is

carried out using a Markov chain Monte Carlo sampling scheme that employs a &focused

sampling' step to combat the high-dimensional representation of the unknown regression

functions. The methodology extends easily to other nonparametric multivariate regres-

sion models. 2000 Elsevier Science S.A. All rights reserved.

JEL classixcation: C11; C14; C15; C31

Keywords: Nonparametric multivariate regression; Bayesian hierarchical SUR model;

Multivariate subset selection; Markov Chain Monte Carlo

0304-4076/00/$ - see front matter 2000 Elsevier Science S.A. All rights reserved.

PII: S 0 3 0 4 - 4 0 7 6 ( 0 0 ) 0 0 0 1 8 - X


2/25

1. Introduction

The aim of nonparametric regression is to estimate regression functions

without assuming a priori knowledge of their functional forms. The price for this#exibility is that appreciably larger sample sizes are required to obtain reliable

nonparametric estimators than for parametric estimators. In this paper, we

consider a system of regression equations that can seem unrelated, but actually

are because their errors are correlated. Such a system of equations is called a set

of&seemingly unrelated'regressions, or a SUR model (Zellner, 1962). This paper

provides a Bayesian framework for reliably estimating the regression functions

in a nonparametric manner, even for moderate sample sizes, by taking advant-

age of the correlation structure in the errors. The most important consequenceof this work is to show that if the errors are correlated, better nonparametric

estimators are obtained by taking advantage of this correlation structure com-

pared to ignoring the correlation and estimating the equations one at a time.

Speci"cally, we consider the system ofm regression equations

y"f(x)#e, for i"1,2,2,m. (1.1)

Here, the superscript denotes that this is the ith ofmpossible regressions, y is

the dependent variable,xis a vector of independent variables and f

,2,f

arefunctions that require estimating in a nonparametric manner. As in the linear

Gaussian SUR model, the regressions are related through the correlation

structure of the Gaussian errors e. That is,

e&N(0, I

), (1.2)

wheree"(e, e,2, e), eis the vector of errors for thenobservations of theith regression and is a positive-de"nite (mm) matrix that also requires

estimation. This paper provides a data-driven procedure for estimating theunknown functions f (for i"1,2,m) and covariance matrix in this model.

Such systems of regressions are frequently used in econometric, "nancial and

sociological modeling because taking into account the correlation structure in

the errors results in more e$cient estimates than ignoring the correlation and

estimating the equations one at a time. Most of the literature on estimating

a system of equations assumes that the f are linear functions. For recent

examples, see Bartels et al. (1996), Min and Zellner (1993) and Mandy and

Martins-Filho (1993). However, in practice the functional forms of the f in

many regression applications are unknown a priori, so that an approach that

estimates their form is preferable. We examine two such cases here. The "rst

concerns print advertisements in a women's magazine and estimates the rela-

tionship between three measures of advertising exposure and the physical

positioning of advertisements in the magazine. The second involves estimating

an intra-day model for average electricity load in two adjacent Australian states.

In this example, we estimate the daily and weekly periodic components of load,

258 M. Smith, R. Kohn/Journal of Econometrics 98 (2000) 257}281


3/25

along with a temperature e!ect. In both examples, signi"cant nonlinear relation-

ships are identi"ed that are di$cult to discern using a parametric SUR ap-

proach. In addition, substantial correlation is estimated between the regressions

and the function estimates di!er substantially from those obtained by estimatingeach of the nonparametric regressions separately and ignoring the correlation

between the equations.

Our approach for estimating the system of equations de"ned at (1.1) and (1.2)

models each of the functions f as a linear combination of basis terms. We

develop a Bayesian hierarchical model to explicitly parameterize the possibility

that these terms may be super#uous and have corresponding coe$cient values

that are exactly zero. A wide variety of bases can be used, including many with

a desired structure, such as periodicity or additivity, a point which is demon-strated in the empirical examples. The unknown regression functions are esti-

mated by their posterior means which attach the proper posterior probability to

each subset of the basis elements, providing a nonparametric estimate that is

both#exible and smooth. We develop a Markov chain Monte Carlo (MCMC)

sampling scheme to calculate the posterior means because direct evaluation is

intractable. This sampling scheme is a correction of the &focused sampler'

discussed in Wong et al. (1997) and our empirical work shows it to be reliable

and much more e$cient than the Gibbs sampling alternative. We prove that theiterates of the focused sampler converge to the correct posterior distribution.

The performance of the new estimator is investigated empirically with a set of

simulation experiments that cover a range of potential regression curves. These

demonstrate the improvement that can be obtained by exploiting the correla-

tion structure in a system of regressions. We note that the solution to the

nonparametric SUR model discussed in this paper is easily extended to other

nonparametric multivariate (or vector) regression models.

Zellner (1962, 1963) provides the seminal analysis of a system of regressionswhen the unknown functionsfare assumed linear in the coe$cients. Srivastava

and Giles (1987) summarize much of the literature dealing with this linear SUR

model. However, recent advances in Markov chain Monte Carlo methods

enable Bayesian analyses of more complex variations of the SUR model. For

example, Chib and Greenberg (1995a) develop sampling schemes that estimate

a hierarchical linear SUR model with"rst-order vector autoregressive or vector

moving average errors and extend the analysis to a time varying parameter

model. Markov chain Monte Carlo methods also provide a solution to estima-

ting reliably nonparametric regressions in a variety of hitherto di$cult situ-

ations. For example, Smith and Kohn (1996) develop nonparametric regression

estimators for regression models where a data transformation may be required

and/or outliers may exist in the data. Yee and Wild (1996) use smoothing

splines to estimate a system of equations in a nonparametric manner, but they

do not have data-driven estimators for the smoothing parameters. In the

example in their paper they use values of the smoothing parameters based on the

M. Smith, R. Kohn/Journal of Econometrics 98 (2000) 257}281 259


4/25

independent variables, but not the dependent variable. Such an approach is an

unsatisfactory way of estimating the smoothing parameters because it does not

take into account the curvature exhibited by the dependent variable. Nor is it

fully automatic because the e!ective degrees of freedom is required as an inputfrom the user.

The paper is organized as follows. Section 2.1 discusses how to model the

unknown functions and why they are estimated using a hierarchical model. The

rest of Section 2 introduces the Bayesian hierarchical SUR model and develops

an e$cient MCMC sampling scheme to enable its estimation. Section 3 uses the

methodology to"t the print advertising and electricity load datasets. Section 4

contains simulation examples which investigate the improvements that can be

made using this estimation procedure over a series of separate nonparametricregressions. Appendix A provides the conditional posterior distributions em-

ployed in the sampling scheme, while Appendix B proves that the focused

sampling step provides an iterate from the correct invariant distribution.

2. Methodology

2.1. Basis representation of functions

Each regression function is modeled as a linear combination of basis func-

tions, so that for a function f,

f(x)"

b(x). (2.1)

Here, B"b

,2, b is a basis of p functions, while the

's are regression

parameters.A large number of authors use such linear decompositions with a variety of

univariate and higher dimensional bases in the single equation case. For

example, Friedman (1991), Smith and Kohn (1996) and Denison et al. (1998) use

regression splines, Luo and Wahba (1997) use several reproducing kernel bases

and Wahba (1990) uses natural splines. In particular, orthonormal bases, such as

wavelet (Donoho and Johnstone, 1994) or Fourier bases have been used.

However, the computational advantage provided by such orthonormal bases

does not easily extend to the case where the errors are correlated, such as in theSUR model. In the case of multiple regressors in an equation, additive models of

univariate bases or radial bases (Powell, 1987; Holmes and Mallick, 1998) can be

used.

Given a choice of a particular basis for the approximation at (2.1), the ith

regression at (1.1) can be written as the linear model

y"X#e. (2.2)



5/25

Here,yis the vector of then observations of the dependent variable, the design

matrix X"[bb2b], b is a vector of the values of the basis function

b

evaluated at the n observations and are the regression coe$cients. The

errors eare correlated with those from the other regressions, as speci"ed in (1.2),and we denote the number of basis terms in the ith equation as p. Note that

many basis expansions employ p*n basis terms and it is inappropriate to

estimate the regression coe$cients using existing SUR methodology because the

function estimates fK, i"1,2, m, would interpolate the data (rather than pro-

duce smooth estimates that account for the existence of noise in the regression).

Therefore, we estimate the regression parameters using a Bayesian hierarchical

SUR model described below that explicitly accounts for the possibility that

many of these terms may be redundant.

2.2. A Bayesian hierarchical SUR model

Consider theith regression of a linear SUR model given at Eq. (2.2), where the

design matrixXis (np) and the coe$cient vector is of lengthp. To accountexplicitly for the notion that variables in this regression can be redundant, we

introduce a vector of binary indicator variables "(

,

,2, ). Here,

corresponds to thekth element of the coe$cient vector of theith regression,say

, with

"0 if

"0 and

"1 if

O0. By dropping the redundant

terms with zero coe$cients, the ith regression can be rewritten, conditional on ,

as

y"X#e. (2.3)

Ifq"

, then the design matrixX is of size (nq) and is a vector of

q elements.

By stacking the linear models for them

regressions, the SUR model can bewritten, conditional on "(, ,2, ), as

y"X#e. (2.4)

Here,y"(y, y,2, y), X"diag(X , X ,2,X) and"( ,2, ).Ifq"q , thenX is an (nq ) matrix and a vector ofq elements. Tocomplete this Bayesian hierarchical model, we introduce the following priors on

the parameters.

(i) Following O'Hagan (1995) we construct a conditional prior for by setting

p( ,)Jp(y ,,)

so that , &N(( , n(XAX )), where A"I and( "(XAX)XAy. This data-based fractional prior contains much lessinformation about than the likelihood. The results obtained using thisprior are similar to those obtained using the prior , &N(0, nI), which



6/25

does not depend on the data. However, we prefer the data-based prior

because the conditional posterior mean of is unbiased.(ii) The prior for is taken as independent of and is the commonly used

non-informative prior discussed in Zellner (1971), wherep()J.(iii) The indicator variables

, k"1,2, p, i"1,2, m, are taken a priori

independently distributed, with p("1)".

(iv) The hyperparameters, i"1,2, m, are taken as independent and givena non-informative uniform prior on (0, 1).

We integrate the hyperparameters"(,2,) out of our analysis, so that

p()"p()p() d"

(

)

(1!)

d"

B(q

#1, p

!q

#1), where B is the beta function. This is a di!erent prior on than that

suggested in Smith and Kohn (1996). Note that the model here is a hierarchical

SUR model as, conditional on , it is simply a linear SUR model.

2.3. Markov chain Monte Carlo sampling

To estimate this model we use the following Markov chain Monte Carlo

sampling scheme with the hyperparameter integrated out.

(1) Generate from , , y(2) Generate from, , y" , , y(3) For i"1, 2,2, m

For j"1, 2,2, p

Generate from,

, yusing the sampling step described below.

In this sampling scheme is generated from a multivariate normal. Generationof the matrix directly from the posterior at step (2) is di$cult because the

fractional prior , is centered at ( , which is a function of. Conse-quently, we use a Metropolis}Hastings step where the proposal Wishart density

is the posterior under a #at conditional prior for . This works well withbetween 60% and 90% of those iterates that are generated being accepted.

Details of how to generate from the distributions at steps (1) and (2) are given in

Appendix A. It is important to note that care has been taken to generate

without conditioning on

at step (3), otherwise the sampling scheme would

be reducible because

is known exactly given.

Step (3) generates an iterate of one element at a time. As discussed in Wong

et al. (1997), using a Gibbs sampler is computationally demanding because has

p elements, with generation from each conditional posterior density

a computationally intensive exercise. To speed up the generation we use the

following &sampling' step, which is an application of the Metropolis}Hasting



7/25

algorithm. Let()"p(

,

, y) be the conditional posterior density of

ands(

)"p(

) be its conditional prior density, with integrated out in

both cases. Then, if is the previous value of, a new value can be

generated as follows.

Sampling Step

(a) Proposal

If"1 then generate from the proposal density

Q("1P"0)"s("0)min

1,(

"0)

s("0)

If"0 then generate from the proposal density

Q("0P"1)"s("1)min1,

("1)

s("1)

(b) Metropolis}Hastings acceptance probabilities

If "0 and "1, then accept with probability"min1, s(

"0)/(

"0); otherwise set"0.

If "1 and "0, then accept with probability"min1, s(

"1)/(

"1); otherwise set"1.

Appendix A calculates the posterior and prior s, with the latter beinga trivial calculation. Generating from the proposal in part (a) can be undertakene$ciently as follows.

First generate u from a uniform distribution on (0, 1), then

(i) if "1 and u(s("1), set"1.

(ii) if "1 and u's("1), generate from the density

p("0)"min(1,("0)/s(

"0)).

(iii) if"0 and u(s("0), then set"0.

(iv) if "0 and u's("0), generate from the density

p("1)"min(1,("1)/s(

"1)).

Generation is e$cient because in the nonparametric regression problem most

of the indicators will be zero ands("0) will be close to one. Hence, most of the

time"0 and u(s("0), so that case (iii) is undertaken most frequently

and is calculated infrequently.



8/25

Appendix B proves that the sampling method described above is a correct

application of the Metropolis}Hastings method and therefore that

()"p(

,

, y) is the invariant distribution of each step. The appendix

also includes a lemma demonstrating that the Metropolis}Hastings ratios atpart (b) of the sampling step will usually be one, or close to one. Because steps

(1)}(3) of the sampling scheme either generate directly from the conditional

posterior distributions, or use a Metropolis}Hastings step, the scheme con-

verges to its invariant distribution, which is the posterior distribution

, , y (Tierney, 1994).The sampling scheme is much faster because step (3) involves a much smaller

number of complex calculations than the full Gibbs sampler. Moreover, focus-

ing the generations is especially important in nonparametric SUR modelscompared to a single equation regression, because there are more basis terms.

We have found this sampler to have strong empirical convergence properties,

a point that is demonstrated in the examples in Sections 3 and 4. These sections

also compare the sampler to the Gibbs alternative and demonstrate that it is

more computationally e$cient.

A sampler that generates solely from the parameter space of is not con-

sidered as it is di$cult to calculate the posterior distribution

, y. Similarly,

samplers that generate from either the parameter space of (,

) or (, ) arenot considered because it is di$cult to recognize the conditional posterior

distribution, y, or calculate

, , y.

2.4. Parameter estimation

Given an initial state for the Markov chain and a &warmup period', after

which the sampler is assumed to have converged to the joint posterior distribu-

tion, we can collect iterates (

,

,

),2, (

,

,

) which forma Monte Carlo sample from the joint posterior distribution. It is this sample that

is used for inference.

The posterior mean E[y] is estimated by the histogram estimate K"((1/J)

). We do not use a mixture estimate because the distribution

of , , y is di$cult to identify (which is also the reason a Metropolis}Hastings step is used at step (2) of the sampler).

The posterior mean of the regression parameters, E[ y], is estimated using

the mixture estimate

K"1

J

E[, , y] (2.5)

Each of the conditional expectations in the sum is simple to calculate because

E[ , , y]"( , while elements ofthat are not common to are set tozero.



9/25


10/25

3.1. Print advertising data

We demonstrate our procedure using n"457 observations of data from six

issues of an Australian monthly women's magazine collected by Starch INRAHooper. Each observation corresponds to an advertisement placed in the

magazine and the following three advertisement exposure scores, which are

recorded from an experimental audience, are used as measures of the various

levels of e!ectiveness of the print advertisement.

y (noted score): Proportion of respondents who claim to recognize the ad as

having been seen by them in that issue.

y

(associated score): Proportion of the respondents who claim to havenoticed the advertiser's brand or company name or logo.

y (read-most score): Proportion of respondents who claim to have read

half or more of the copy.

These scores are considered to measure advertisement exposure at increasing

levels of depth. It has long been thought that the positioning of an advertisement

within an issue a!ects its exposure to an audience (Hanssens and Weitz, 1980).

To quantify this we constructed the variable P"(page number)/(number ofpages in issue) to represent the position in the issue in which each advertisement

appeared.

To estimate how the exposure of a print advertisement is a!ected by its

position in the magazine, we considered the three nonparametric regressions

y"f(P)#e for i"1, 2, 3. (3.1)

To model each unknown function we use a thin plate spline basis (Powell, 1987),with basis terms b(x)"x!k

log(x!k

), j"1,2, n, where k ,2, k are

the so-called &knots'which we set equal to the nobservations of the independent

variable. Following previous authors (Wahba, 1990) we also augment these

basis terms with an intercept and linear term, so that p"p"p"459.

Expected features in the functionsfinclude high casual attention to advertise-

ments placed in the front (and to a lesser extent back) of the magazine, while the

pre-editorial slots (where P is about 0.7) are thought to attract more indepth

attention. The three scoresy, yandyare highly positively correlated and it is

likely that the errors are correlated, so a SUR model appears appropriate.

In a parametric SUR model with the same independent variables the general-

ized and ordinary least-squares estimate of the regression coe$cients are the

same. However, this result does not extend to the case where variable selection is

undertaken on the terms in the regressions. In particular, this applies to the

nonparametric function estimators used in this paper as they are based on

variable selection and model averaging applied to linear basis decompositions.



11/25

Fig. 1. (a)}(c) Function estimates of f, f and f, respectively. Bold lines are the nonparametric

SUR estimates, while the dashed lines are the single equation estimates. Panels (d)}(f) contain scatterplots ofPversus the standardized uncorrelated residuals resulting from the nonparametric SUR "t.

The equations at (3.1) were estimated both as a SUR system and individually

with the analogous single equation estimator that ignores any correlation

between the equations. The resulting function estimates are plotted in

Figs. 1(a)}(c) and it can be seen that the SUR and equation-by-equationestimates di!er substantially. The estimate K from the SUR model is given

below (correlations in italics)

K"2.16610 2.10310 1.205100.956 2.23710 1.275100.822 0.856 0.99210

con"rming the existence of high correlation in the errors. The SUR function

estimates suggest that the front (and to a lesser extent) back of the magazine are

areas in which advertisements achieve higher average exposure; though this is

more prominent for the noted and associated scores, y, y, than for the read-

most score y. The pre-editorial slots also result in increased exposure, with

a particularly positive e!ect on indepth exposure, as measured by y. The

relationships are distinctly nonlinear and would be hard to discern a priori using

parametric SUR estimation. To help con"rm that the nonparameteric SUR



12/25

Table 1

Ratio of variances for the proposed sampler over the Gibbs sampler for the function estimates off,

f and f calculated at four points on the domain ofP

Eq. (1) Eq. (2) Eq. (3)

P"0.2 2.028 3.027 2.195

P"0.4 6.564 4.684 6.484

P"0.6 2.546 3.407 7.102

P"0.8 4.174 7.634 6.660

(NSUR) estimates correctly capture the nonlinear relationships between

y, y, y and P, we calculated the standardized uncorrelated residuals

r( , i"1, 2, 3. Figs. 1(d)}(f) plot these residuals againstP, and these con"rm that

there is no signi"cant nonlinear relationship between the residuals and P.

We also estimated the NSUR model using the sampling scheme outlined in

Section 2.3, but using a Gibbs sampler where the elementswere generated at

step (3) directly from their posteriors. With both sampling schemes, we used

a warmup period of 2000 iterations and a further 1000 iterates for inference,resulting in 45933000 generations from the conditional posterior at (A.1). Incomparison, the proposed sampler took only 91,511 computationally equivalent

generations, 2.2% of the number required by the Gibbs sampler. Because the

Gibbs sampler involves more generations, it would be expected to result in more

random iterates than the proposed sampler. To measure this, we calculated the

variances of the function estimates. Table 1 provides the ratio of the variance for

the proposed sampler over that of the Gibbs sampler. It does so for function

estimates calculated atP"

0.2, 0.4, 0.6, 0.8 for all three regression equations. Itreveals that the proposed sampler needs to be run for two to seven times as

many iterations as the Gibbs sampler to obtain function estimates with the same

variances. However, this increased number of iterations still takes only 4}14%

of the time required by the Gibbs sampler and is therefore substantially more

e$cient overall. We note that function estimates are averages of terms in a

stationary sequence whose variances are computed as in Box et al. (1994, p. 34).

3.2. Intra-day electricity load model

Short-term forecasting of regional electricity load based on intra-day data is

an important problem for electricity utilities and electricity regulatory authori-

ties. Load forecasts are required both to determine electricity dispatch schedules

and to price electricity in wholesale markets. Harvey and Koopman (1993) and

Smith (2000) consider nonparametric regressions with daily, weekly and possible

temperature e!ects as models for electricity load. We consider estimating such



13/25

a model for three weeks of half-hourly electricity load data from the adjacent

Australian states of New South Wales (NSW) and Victoria (VIC). Let and

be the half-hourly average electricity load of NSW and VIC, respectively,

OD and

O=

be the time of day and week of the half-hourly observationsnormalized to [0, 1) and be the temperature in the NSW state capital of

Sydney. We estimate the additive nonparametric SUR:

"f

(OD)#f

(O=)#f

()#e,

"f

(OD)#f

(O=)#e, (3.2)

where the periodic pro"le of electricity load has been decomposed into periodicdaily (f

) and weekly (f

) e!ects for both states. An additive temperature e!ect is

considered for NSW, but we do not include one for VIC because half-hourly

temperature data were not available to us for this state.

We model both f

and f

with a periodic quadratic reproducing basis (Luo

and Wahba, 1997) where b(x)"((x!k

!

)!1/12)/2. We place the knots

k

at all the possible observations, so that for the time of day e!ect

k"i/48, i"1,2, 48, and for the weekly e!ectk

"i/336, i"1,2, 336. This

basis was chosen because it ensures that the functions f , f , f and f are allperiodic on [0, 1). The temperature e!ect was modeled using a thin plate spline

with knots at all observations.

The estimated covariance is (correlation in italics)

K"29088.13 26093.55

0.5208 86299.88,

where var(e)(var(e) because we control for the variation in temperature in

the NSW equation. The positive correlation between the states is likely to be

due, in part, to common weather e!ects and television programming. Figs. 2(a)

and (b) plot the sum of the estimated daily and weekly e!ects for both states

againstO=, while the estimated additive Sydney temperature e!ect is given in

Fig. 3. Because the regressions are additive, we present the functions normalized

so thatf(0)"0. The data were collected during March 1998 (late Summer) and

higher temperatures result in increased load due to increased usage of air

conditioning. Estimates of the periodic pro"le are more detailed for NSW as the

model controls for temperature, while the periodic pro"les are more smooth for

VIC. The functions are highly nonlinear and, as Harvey and Koopman (1993)

highlight, are extremely di$cult to model using parametric nonlinear models.

The two equations at (3.2) were also estimated using the analogous single

equation estimator. Fig. 4 plots the di!erence between the SUR and single

equation estimates of the sum of the daily and weekly periodic e!ects. The range



14/25

Fig. 2. Sum of the estimated daily and weekly periodic e!ects for VIC (panel (a)) and NSW (panel

(b)). The dashed curves are those resulting from estimation using the proposed sampler and the bold

curves are those resulting from estimation using a full Gibbs sampler. Both estimates are plotted

together for comparison, although they are almost identical and are hard to visually distinguish.

of the di!erence is about 250 and 400 megawatts for VIC and NSW, respectively,

which is substantial compared to the estimates of

var(e) and

var(e).

Because the load pro"les devolve slowly during the year and are e!ectively only

static over periods of around two to three weeks, larger sample sizes cannot be

used and it is useful to exploit the correlation between equations.

To investigate the performance of the sampling scheme in Section 2.3 we also

estimated the nonparametric SUR model for the electricity load data using a full

Gibbs sampler which generated directly from each of the

p posterior

distributions of the binary indicatorsat step (3) of the sampling scheme. The



15/25

Fig. 3. Same as in Fig. 2, but for the estimated Sydney additive temperature e!ect fK

.

full Gibbs sampler was run using the same number of iterations for the warmup

and sampling periods as the proposed sampler. Figs. 2 and 3 compare theresulting curve estimates for the Gibbs sampler (bold lines) and proposed

sampler (dashed lines). The curve estimates are virtually identical and sometimes

the two curve estimates are so similar that the dashed lines cannot be made out.

It is di$cult to see how this could occur if the proposed sampler was not

converging to the same distribution as the Gibbs sampler and the sampling

scheme was not mixing well with respect to the indicator variables . The results

are identical regardless of initial starting state for either sampler.

4. Simulation experiments

4.1. Positively correlated univariate regressions

This simulation is concerned with the case where the errors are highly

correlated between regressions. It highlights the improvement in the quality of



16/25

Fig. 4. The di!erence between the SUR and single equation estimates of the sum of the daily and

weekly periodic e!ects. Panel (a) is for the Victorian equation and panel (b) is for the New South

Wales equation.

the estimates that can be obtained when such correlation is modeled rather thanignored. There are m"4 univariate regressions, with the covariance matrix

speci"ed below at (4.1). Note that the standard deviation of the errors

(var(e))"1 is high compared to the range of the functions.

Four true functions were carefully chosen to represent a wide variety of

possible relationships. These are f(x)"sin(8x) (which is highly oscillatory),f(x)"((x, 0.2, 0.25)#(x, 0.6, 0.2))/4, with (x, a, b) being a normal density

of meanaand standard deviation b, (which requires a locally adaptive estimator

as there are di!erent degrees of smoothness on the left and right of the function),

f(x)"1.5x (which was chosen as many relationships are often thought to

be linear) and f(x)"cos(2x) (which is a smooth nonlinear function). Theindependent variables for the four univariate regressions were x&U(0, 1),

x&U(0, 1) and

x

x&N0.5

0.5, 0.31 0.6

0.6 1



17/25

We generated n"100 data points from this true SUR model and applied the

nonparametric SUR estimator to this data. We use a quartic reproducing kernel

basis (Luo and Wahba, 1997) with basis terms

b(x)"

1

24x!x !1

2

!1

2x!x !1

2#

7

240 for i"1,2, n,augmented with an intercept and linear term.

To assess the quality of the resulting estimates of the four functions, we

calculated the mean squared di!erence between the function estimates and thetrue functions. This measure of distance between the two is de"ned as

MSD"

1

200

(fK(z

)!f(z

)),

where min(x)"z(z

(2(z

"max(x) is an evenly spaced grid over

the domain ofx. For the same data we also "t an analogous single equation

univariate nonparametric estimator to the four regressions. The mean squareddi!erence was also calculated for each of these four function estimates.

The entire process was repeated one hundred times. Figs. 5(a)}(d) give box-

plots of the one hundred resulting values of log(MSD) for each of the four

functions (i"1, 2, 3, 4) and for both the nonparametric SUR estimator

(NSUR) and individual nonparametric estimators (NR). Fig. 5 shows that

taking into account the correlation between the errors has substantially and

consistently improved the resulting estimates of all the regression functions.

To examine the qualitative improvement that occurs, we focus on the singledata set corresponding to the 50th sorted value of

MSD

for the non-

parametric SUR estimator. This data set can be regarded as providing a &typical'

example of the procedure and is plotted as four scatter plots in Figs. 5(e)}(h) and

again in Figs. 5(i)}(l). The nonparametric SUR estimates of the four functions for

this data appear in Figs. 5(e)}(h) and the estimates for the separate nonparamet-

ric regressions appear in Figs. 5(i)}(l). These "gures show that the nonparametric

SUR estimator signi"cantly outperforms the separate nonparametric estimators

which ignore the correlation between the separate regressions. The variance ofthe errors and its estimate K for this data set are given below.

"1 0.96 0.64 0.93

1 0.98 0.90

1 0.85

1 K"0.824 0.707 0.489 0.741

0.845 0.849 0.674 0.755

0.570 0.773 0.895 0.656

0.850 0.854 0.722 0.922 (4.1)



18/25

Fig. 5. (a)}(d) Boxplots of the log(MSD) fori"1, 2, 3 and 4, respectively. The left-hand boxplot is

for the NSUR estimator, while the right-hand boxplot is for the NR estimation procedure. Panels

(e)}(h) contain scatter plots ofxagainsty, along with the function estimatesfK(bold line) and true

functions f (dotted line) for i"1, 2, 3, 4 that result from applying the NSUR estimator. Panels

(i)}(l) plot the function estimatesfK(bold line) and true functionsf(dotted line) fori"1, 2, 3, 4 that

result from applying the NR procedure to the same data.



19/25

Table 2

Time taken by the NSUR procedure to complete a "t to data generated form the model in Section

4.1 for four di!erent sample sizes. In the sampling procedure 2000 iterates were used for the warmup

and the Monte Carlo sample consisted of a subsequent 1000 iterates. The machine used was a low

end workstation

Sample size n"100 n"200 n"400 n"800

Time (s) 43 58 213 3850

The estimate compares favorably to the&best possible'estimate K

that arises

from the sample variance of the true errors themselves, which are knownbecause this is a simulated example.

K"varY((e, e, e, e))"

0.842 0.702 0.449 0.698

0.821 0.869 0.673 0.732

0.522 0.770 0.879 0.603

0.823 0.850 0.697 0.854.To demonstrate that the NSUR estimator is practical to implement, we reportthe time required to"t this four equation system for a variety of sample sizes in

Table 2. The computer used was a low-end modern workstation while the code

was written e$ciently in FORTRAN. Although these timings are implementa-

tion dependent, they do indicate that this computationally intensive Markov

chain Monte Carlo procedure runs in a reasonable time.

4.2. Unrelated regressions

The previous simulation considers a related set of regressions and demon-

strates the improvements in the regression function estimates that can occur

when correlation between the regressions is modeled and estimated, rather than

ignored. However, is there a risk of degrading the function estimates by

modeling correlation that does not exist?

To investigate this case, we repeat the simulation undertaken above, except

that the regressions are unrelated, with"0.5I

. Fig. 6 provides the equivalent

output for this case as was produced in Fig. 5. It can be seen from the boxplots in

Figs. 6(a)}(d) that, in general, there is a slight deterioration in the log(MSD) for

the NSUR estimator compared to the NR estimation procedure. This is ex-

pected as the regressions are actually not related and the NSUR procedure also

estimates. However, the loss in the e$ciency of the function estimates is very

small and the function estimates from the NSUR estimator (Figs. 6(e)}(h)) are

almost identical to those from the NR procedure (Figs. 6(i)}(l)) for the median

dataset.



20/25

Fig. 6. As in Fig. 5, but for the uncorrelated model in Section 4.2.



21/25

Nevertheless, in examples whenm is large, reliably estimating them(m#1)/2

free parameters ofin addition to the unknown functions may prove di$cult.

In such cases, a more parsimonious representaton for can be used and

estimated in combination with the unknown functions using MCMC. Forexample, Shephard and Pitt (1998) discuss the MCMC estimation of factor

decompositions for , while Smith and Kohn (1999) examine parsimonious

representations for based on the Cholesky decomposition of.

Acknowledgements

The authors would like to thank Denzil Fiebig, Paul Kofman, Steve Marron,Gael Martin and Tom Smith for useful comments. They would also like to

thank two anonymous referees for improving the clarity of the presentation and

for pointing out an error in a previous implementation of the sampling scheme.

Both Michael Smith and Robert Kohn are grateful for the support of Australian

Research Council grants.

Appendix A

A.1. Generating from , , y

This conditional distribution can be calculated exactly, as

p( , , y)Jp(y , , )p( , )

Jexp!1

2n#1

n (!( )XAX (!( ),so that

, , y&N( , n

n#1(XAX ).

Here,( and A are de"ned in Section 2.2.

A.2. Generating from , , y

This conditional distribution is di$cult to recognize as is embedded

in the conditional prior for . Therefore, to obtain an iterate we use aMetropolis}Hastings step; see Chib and Greenberg (1995b) for an introduction



22/25

to this method. The proposal density from which we generate a candidate iterate

is given by

q(

)J

p(y ,

, )p(

)

Jexp!1

2tr()

which is a Wishart(, n, m) density. Here, is an (mm) matrix with ijthelement

"(y!X)(y!X). A newly generated iterate is ac-

cepted over the old value

with probability

"minp(

, , y)q( )

p(

, , y)q(), 1"min

p( , )p( , )

, 1.High acceptance rates of 60}90% are obtained because the proposal densityq( ) )

is equal to the correct conditional density except for the factor p( , ).

A.3. Calculating,

, y

This density requires calculation to enable generation of

in the sampling

scheme.

p(,

, y)Jp(y, , )p( , ) dp()

J(n#

1)

exp

!

1

2S(,

)

p(

) (A.1)

where S(, )"yAy!yAX(XAX )XAyand in Eq. (A.1) the regres-sion coe$cient is integrated out using

&N( , n

n#1(XAX).

The conditional posterior can then be calculated by evaluating (A.1) for "1and

"0 and normalizing.

A.4. Calculatingp(

)

The conditional prior can be calculated by integrating out the hyperprior,with p(

)J

()(1!) d"B(q#1, p!q#1), so that



23/25

p("1

)"1/(1#h), where h"(p!a)/(a#1) and a"

is the

number of elements of

that are one.

Appendix B

This appendix provides a proof that the invariant distribution of the proposed

sampling step in Section 2.3 is()"p(

,

, y).

Proof is as follows. As is a binary variable, the proof is undertaken bycalculating the Metropolis}Hastings ratio for the two transitions

("0P"1) and ("1P"0) and showing that it is applied

correctly in step (b) of the sampling step. Let Q(P) be the proposaldensity, then

Q(1P0)"s(0)min1,(0)/s(0) and Q(0P1)"s(1)min1,(1)/s(1)

(B.1)

The Metropolis}Hastings ratios are

"min1,

(1)Q(1P0)(0)Q(0P1), "min1,

(0)Q(0P1)(1)Q(1P0).

Because (1)"1!(0) and s(1)"1!s(0), (1)/s(1)'1 i! (0)/s(0)(1 and(1)/s(1)(1 i!(0)/s(0)'1. Using these inequalities and substituting the pro-posal densities at (B.1) into the ratios, it can be seen that

"min(1, s(0)/(0))

and"min(1, s(1)/(1)). These correspond to the ratios applied in part (b) of

the sampling step. Hence the sampling step is reversible and its invariant

distribution is().

Lemma.Let(0)"p(y"0,

,)and(1)"p(y

"1,

,)be the likeli-

hoods for"0 and

"1. Then,

"s(0)#s(1)min(1,(1)/ (0)) and

"s(1)#s(0)min(1,(0)/ (1))

(B.2)

ProofNote that

(1)"s(1)(1)

s(1)(1)#s(0)(0) and (0)"

s(0)(0)

s(1)(1)#s(0)(0).

Substituting these into the expressions for

and

found in part (b) results in

the expressions at (B.2).



24/25

The importance of this lemma is it reveals that the Metropolis}Hastings

ratios will usually be one, or close to one. Because s(1)#s(0)"1, "1 if

(1)/ (0)*1, while"1 if(0)/ (1)*1. Also, even if(1)/ (0)(1, because

in the nonparametric regression problem s(0) is usually very close to 1,'s(0) is close to 1.

References

Bartels, R., Fiebig, D., Plumb, M., 1996. Gas or Electricity, which is cheaper?: an econometric

approach with an application to Australian expendicture data. Energy Journal 17, 33}58.

Box, G., Jenkins, G., Reinsel, G., 1994. Time Series Analysis, 3rd Edition. Prentice-Hall, New Jersey.Chib, S., Greenberg, E., 1995a. Hierarchical analysis of SUR models with extensions to correlated

serial errors and time varying parameter models. Journal of Econometrics 68, 339}360.

Chib, S., Greenberg, E., 1995b. Understanding the Metropolis}Hastings algorithm. The American

Statisitician 49, 327}335.

Denison, D., Mallick, B., Smith, A., 1998. Automatic bayesian curve "tting. Journal of the Royal

Statistical Society, Series B 60, 333}350.

Donoho, D., Johnstone, I., 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika 81,

425}455.

Friedman, J., 1991. Multivariate adaptive regression splines. The Annals of Statistics (with dis-

cussion) 19, 1}141.Hanssens, D., Weitz, B., 1980. The e!ectiveness of industrial print advertisements across product

categories. Journal of Marketing Research 17, 294}306.

Harvey, A., Koopman, S., 1993. Forecasting hourly electricity demand using time-varying splines.

Journal of the American Statistical Association 88 (424), 1228}1253.

O'Hagan, A., 1995. Fractional Bayes factors for model comparison (with discussion). Journal of

Royal Statistical Society Series B 57, 99}138.

Holmes, C., Mallick, B., 1998. Bayesian radial basis functions of variable dimension. Neural

Computation 10, 1217}1233.

Luo, Z., Wahba, G., 1997. Hybrid adaptive splines. Journal of the American Statistical Association

92, 107}116.

Mandy, D., Martins-Filho, C., 1993. Seemingly unrelated regressions under additive heteroskedas-

ticity: theory and share equation applications. Journal of Econometrics 58, 315}346.

Min, C., Zellner, A., 1993. Bayesian and non-Bayesian methods for combining models and forecasts

with applications to forecasting international growth rates. Journal of Econometrics 56, 89}118.

Powell, M., 1987. Radial basis functions for multivariate interpolation: a review. In: Mason, J., Cox,

M. (Eds.), Algorithms for Approximation.

Shephard, N., Pitt, M., 1998. Analysis of time varying covariances: a factor stochastic volatility

approach. Preprint.

Smith, M., 2000. Modeling and short term forecasting of new south Wales electricity system load.Journal of Business and Economic Statistics, to appear.

Smith, M., Kohn, R., 1996. Nonparametric regression via Bayesian variable selection. Journal of

Econometrics 75 (2), 317}344.

Smith, M., Kohn, R., 1999. Bayesian parsimonious covariance matrix estimation. Preprint.

Srivastava, V., Giles, D., 1987. Seemingly Unrelated Regression Equations Models. Marcel Dekker,

New York.

Tierney, L., 1994. Markov chains for exploring posterior distributions. The Annals of Statistics 22,

1701}1762.



25/25

Wahba, G., 1990. Spline Models for Observational Data. SIAM, Philadelphia.

Wong, F., Hansen, M., Kohn, R., Smith, M., 1997. Focused sampling and its application to

nonparametric and robust regression. Preprint.

Yee, T., Wild, C., 1996. Vector generalised additive models. Journal of the Royal Statistical Society,

Series B 58, 481}493.Zellner, A., 1962. An e$cient method of estimating seemingly unrelated regression equations and

tests for aggregation bias. Journal of the American Statistical Association 57, 500}509.

Zellner, A., 1963. Estimators for seemingly unrelated regression equations: some exact "nite sample

results. Journal of the American Statistical Association 58, 977}992.

Zellner, A., 1971. An Introduction to Bayesian Inference in Econometrics. Wiley, New York.


Download - [Smith, Kohn, 2000] Nonparametric Seemingly Unrelated Regression

Top Related