uses of entropy and divergence measures for evaluating econometric approximations and inference

Journal of Econometrics 107 (2002) 313–326www.elsevier.com/locate/econbase

Uses of entropy and divergence measuresfor evaluating econometric approximations

and inferenceAman Ullah

Department of Economics, University of California, Riverside, CA 92521, USA

Abstract

This paper provides the uses of the Kullback and Leibler divergence measure in two directions.First a new result on the asymptotic expansion of the Kullback and Leibler divergence is providedwhich can be used for comparing two distributions. An application to compare two alternativeapproximate distributions from the unknown true exact distribution of a Student t-statistic ina dynamic regression is given. Second we explore the application of the Kullback and Leiblerdivergence measure to non-parametric estimation and speci-cation testing in regression models. Anon-parametric speci-cation test is proposed. c© 2002 Elsevier Science B.V. All rights reserved.

JEL classi�cation: C12; C13; C14

Keywords: Entropy; Divergence; Non-parametric; Inference

1. Introduction

Since the seminal works of Shannon (1948), Kullback and Leibler (1951), andJaynes (1957), an increasing number of applications of entropy and Kullback–Leiblerdivergence measures have appeared in various applied sciences; for example, see thebooks by Kapur and Kesavan (1992) and Cover and Thomas (1991). For the applica-tions in econometrics, see the recent surveys by Maasoumi (1993) and Ullah (1996)and the book by Golan et al. (1996). However, it is clear from these works that inmany areas of econometric research the usefulness of divergence measures has notbeen explored fully.The modest aim of this paper is to look into the application of the Kullback–Leibler

divergence measure in two directions, the -rst of which is in the area of evaluating theaccuracy of approximations in econometrics. For example, in the extensive literature

E-mail address: [email protected] (A. Ullah).

0304-4076/02/$ - see front matter c© 2002 Elsevier Science B.V. All rights reserved.PII: S 0304 -4076(01)00126 -9

314 A. Ullah / Journal of Econometrics 107 (2002) 313–326

on -nite sample econometrics the exact densities of econometric estimators and of teststatistics are usually unknown due to the diAculty in deriving them. In view of this,researchers have developed approximations of these unknown exact densities by Nagar(1959) large-n and Kadane (1971) small-� approximation techniques. The question ofcomparing the accuracy of these two alternative approximations have been exploredin Srivastava et al. (1980), Ullah (1980) and Kiviet and Philips (1993) in the casewhere the exact results are known, and in the important work by Atukorala (1999)where the Monte Carlo evaluation of the Kullback–Leibler divergence measure is doneby specifying a known distribution. In this paper, we explore the above question byproviding a very general analytical result on the expectation of a real valued functionof a non-i.i.d. random vector whose -rst four moments are -nite. Then we provide, asa special case of this, the analytical result for the Kullback–Leibler divergence whichcan be used to evaluate approximations and to perform other econometric analysis. Wenote that our general result can also be used to evaluate the moments of a large classof econometric estimators and test statistics. The second direction of the application ofthe Kullback–Leibler divergence is towards non-parametric estimation and hypothesistesting. Here, we merely look into the non-parametric kernel regression estimatorsusing the principle of minimum local Kullback–Leibler divergence measures and wederive an F-type non-parametric test statistic for various speci-cation testing problemsin econometrics.The plan of the paper is as follows. In Section 2 we present the entropy and diver-

gence measures. Then in Section 3 we develop our main results. Section 4 providesthe results on non-parametric estimation and speci-cation testing. Finally, in Section 5we present our conclusion.

2. Entropy and divergence functions

In this section, we present the basic entropy and divergence measures used in statis-tics and econometrics.The entropy of an n× 1 continuous random vector y = (y1; : : : ; yn)′ is given as

H (g) =−∫ ∞

−∞g(y) log g(y) dy

=−E log g(y); (1)

where g(y) is the joint probability density function for the continuous distributionG. The entropy provides a measure of information (uncertainty) in the sense thatit describes the diAculty of predicting an outcome of a random variable. A highlyconcentrated distribution has a smaller entropy (easier to predict its outcomes and moreinformative) than a less concentrated distribution. In the literature therefore −H (g) hasoften been used as an information measure for the random variable y with the densityg(y): In general, for independent random variables

H (g) =n∑

i=1

H (gi) where gi = g(yi):

A. Ullah / Journal of Econometrics 107 (2002) 313–326 315

As an example, if y v N(�; �) where Ey= � is an n× 1 mean vector and V (y) =�= ((�ij)); i; j = 1; : : : ; n; is the covariance matrix, then

H (g) =n2(1 + log 2�) +

12log |�|: (2)

Now we turn to the divergence measure. Let g(y) and f(y) be two densities ofthe random vector y. Then the Kullback–Leibler (1951) divergence (discrepancy) orinformation measure of f from g is given by

I(g; f) =∫ ∞

−∞g(y) log

g(y)f(y)

dy

=E log(

g(y)f(y)

)=−H (g)− E logf(y): (3)

Some of the properties of I(g; f) are as follows:(i) I(g; f)¿ 0; the equality holds if and only if g(y) = f(y) almost everywhere.(ii) I(g; f) is convex in g and f.(iii) For mutually independent random variables

I(g; f) =n∑

i=1

I(g(yi); f(yi)):

The smaller the value of I(g; f) the closer is f(y) to g(y):The Kullback–Leibler information measure can also be used to compare two density

functions f1(y) and f2(y) relative to g(y). This is given by the diGerence

D(f1; f2) = I(g; f1)− I(g; f2)

=∫

log(f2(y)f1(y)

)g(y) dy

=E[log

f2(y)f1(y)

]: (4)

If D¿ 0; f2 is closer to g; and when D¡ 0; f1 is closer to g.As an example of (3), the discrepancy between two n-dimensional normal distribu-

tions g=N(�g; �g) and f =N(�f; �f) is

I(g; f) = 12 [tr(�g�−1

f )− log |�g�−1f | − n] + 1

2 (�f − �g)′�−1f (�f − �g) (5)

where tr represents the trace of a matrix. Thus, in the normal distribution case, dis-crepancy between densities is explained by the discrepancy in the covariance matricesand the discrepancy in means.We note that H; I; and D above depend on an unknown g; and hence their values are

not determined. In practice, therefore, one can calculate their estimated values based


on the data yi; i = 1; : : : ; m from g: For example,

H (g) =− 1m

m∑i=1

log g(yi); (6)

where

g(yi) =1

man

m∑j=1

K(yj − yi

a

)(7)

is the non-parametric kernel density estimator with the window-width a and the kernelK , see Silverman (1986) and Pagan and Ullah (1999).Similarly, when f (y) is known,

I(g; f) =−H (g)− 1m

m∑i=1

logf(yi) (8)

and, when f(y) is unknown, I(g; f)=−H (g)−1=m∑

log f(yi), where f is the kerneldensity estimator. Further

D(f1; f2) =1m

m∑i=1

logf2(yi)f1(yi)

: (9)

Alternative estimators of H; I and D are, respectively, H = − ∫g(y) log g(y) dy,

I=∫g(y) log (g(y)=f(y)) dy and D=

∫g(y) log (f2(y)=f1(y)) dy; where g(y) is given

by (7). But these would involve tedious calculations. Under the i.i.d. assumption,

ED(f1; f2) = D(f1; f2);

V (D(f1; f2)) =1mV[log

f2

f1

]: (10)

In many inferential problems, however, where the subject of study is an econometricestimator or test statistic (y = t or F or an estimator), the above estimators may notbe useful unless Monte Carlo samples are drawn from a known g: This could be,for example, in studying the maximum entropy of a t or F ratio whose density g isnot known, studying the discrepancy of an approximate distribution from the exactdistribution which is not known, and comparing two approximating densities f1 andf2 of an unknown exact density g: For such situations the analytical results in Section3 can be useful. Section 4 looks into the applications of (8) and (9) for non-parametricestimation and hypothesis testing.

3. Asymptotic expansions of H , I and D

3.1. Main results

It was noted above that the evaluation of entropy H; and hence discrepancy measuresI and D may be quite diAcult for an unknown g: Further, an analytical expression ofthem may be diAcult to obtain if the random vector y under consideration is an


econometric estimator or test statistic whose density function is usually unknown. Inwhat follows, assuming disturbances to be small, we present small-� approximation forthe expectation of a general real valued function of the random vector y,

q= h(y); (11)

where y has the density g(y) and it is such that

y = � + �u; (12)

� = Ey and u is an n × 1 disturbance vector. We assume that, for i = 1; : : : ; n; ui isnon-i.i.d. such that

Eui = 0; Euiuj = �ij; Euiujuk = �ijk ; Euiujukul = �ijkl: (13)

For the special case of i.i.d. disturbances Eui =0, Eu2i =1; Eu3i = 1 and Eu4i = 2 + 3;where 1 and 2 are Pearson’s measures of skewness and kurtosis of the distribution.For the normal distribution 1 and 2 are zero while for the symmetric distribution only 1 is zero. Thus, the non-zero values of 1 and 2 indicate a departure from normality.We also assume that the fourth order derivative of h(y) exists in the neighborhood ofEy = �; that is

@sh(y)@ys1

1 ; : : : ; @ysnn

¡∞; (14)

where s= s1 + · · ·+ sn = 1; 2; 3; 4.Our objective is to present, under (12) and (13), the small-� expansion of the

expected value of h(y); that is

Eq= Eh(y); (15)

for details on small-� approximation, see Kadane (1971) and Ullah and Srivastava(1994). Our results here are for fairly general h(y), and for any density g(y) having-nite moments of at least order four. Once the result (15) is obtained, the results forH; I and D follow as the special cases of h(y):

Theorem 1. If the disturbances follow (13); and (14) holds; the expectation in (15)to O(�4) is

Eh(y) = h(�) + �2#2 + �3#3 + �4#4; (16)

where

#2 =12

∑i

∑j

�ij$ij; #3 =16

∑i

∑j

∑k

�ijk$ijk ;

#4 =124

∑i

∑j

∑k

∑l

�ijkl$ijkl;

$ij =@2h(y)@yi@yj

; $ijk =@3h(y)

@yi@yj@ykand $ijkl =

@4h(y)@yi@yj@yk@yl

(17)

are all evaluated at y = �; and i; j; k; l= 1; : : : ; n.

The derivation of Theorem 1 is straightforward and it is done in Section 3.4.


Corollary 1. Under the i.i.d. assumption of the disturbance vector in (12); and under(14); the expectation in (15) to O(�4) is

Eh(y) = h(�) + �2%2 + �3 1%3 + �4[%4 2 + 3%22]; (18)

where for s= 2; 3; 4:

%s =1s!

∑i

[@sh(y)@ys

i

]y=�

; %22 =14!

∑i

∑i

[@sh(y)@y2

i @y2j

]y=�

: (19)

This result follows from Theorem 1, also see Ullah et al. (1995).

Corollary 2. If the disturbances follow (13); then the expectation in Theorem 1 givesthe entropy H (g) in (1); Kullback–Leibler divergence I(g; f) in (3); and the relativedi6erence D(f1; f2) in (4); respectively; to O(�4); as

H (g) =−[log g(�) + �2#2g + �3#3g + �4#4g]; (20)

I(g; f) = log(

g(�)f(�)

)+ �2(#2g − #2f) + �3(#3g − #3f)

+�4(#4g − #4f); (21)

D(f1; f2) = log(f2(�)f1(�)

)+ �2(#2f2 − #2f1 ) + �3(#3f2 − #3f1 )

+�4(#4f2 − #4f1 ); (22)

where #pg; #pf; #pf1 and #pf2 ; for p=2; 3; 4; are as given in (17) with h(y)=g; f; f1

and f2; respectively.

This result follows from Theorem 1 by using appropriate substitutions. For example,in case of entropy, we note that H (g) in (20) is obtained by writing h(y) = log g(y)in (16) and (17). Similarly, the results in (21) and (22) follow by using h(y) aslogf(y); logf1(y) and logf2(y). This completes the derivation of Corollary 2. Inpractice, the following results on the derivatives are useful. That is, de-ning di =@g(y)=dyi; and dij; dijk and dijkl corresponding to $ij, $ijk and $ijkl in (17) withlog g(y) replaced by g(y), we can verify that $ij =(g(y))−1dij − (g(y))−2didj; $ijk =(g(y))−1dijk − (g(y))−2(didjk +djdik +dkdij)+ 2(g(y))−3didjdk and $ijkl =(g(y))−1

dijkl− (g(y))−2(dldijk +didjkl+djdikl+dkdijl+dildjk +djldik +dkldij)+2(g(y))−3×(dldidjk + dldjdik + dldkdij + didkdjl + didjdkl + dildjdk) − 6(g(y))−4didjdkdl. Allthese terms are to be evaluated at y = �.

3.2. Remarks on the results

The following observations may be deduced from our results given above. First,the result in Theorem 1 provides the asymptotic expansion of the mean of a general


function h(y) of the non-normal and non-i.i.d. vector y: The results for the entropyand Kullback–Leibler divergence measures follow as special cases of this general result,and they are given in Corollary 2.Further, since a large number of the econometric estimators and test statistics can be

expressed as a function of the vector y their moments can be derived from this generalresult. For example, the coeAcient of variation (cv) of an economic variable is s= Oy,where s2 =�(yi − Oy)2=n and Oy=�yi=n are such that cv= s= Oy=h(y). Taking second tofourth orders derivatives of cv and substituting them in (16) we get E(cv). Similarly,the goodness of -t measure R2 in a linear regression, the Durbin–Watson statistic fortesting serial correlation, the least squares estimator of the lag coeAcient (see (26)) ina dynamic AR(1) model, the two stage least squares in a structural equation, amongothers, can be written as h(y) which is the ratio of quadratic forms in y. Althoughnot done here, their moments easily follow from Theorem 1. For the i.i.d. case, seeUllah and Srivastava (1994) and Ullah et al. (1995), for some of these examples. Thehigher order moments also follow from the direct use of Theorem 1. For example,for the rth moment of h(y) we need to evaluate Eg(y) where g(y) = hr(y) and thisexpectation can be obtained by replacing g(y) with h(y) in Theorem 1. The resultin Theorem 1 is also an important new result in the sense that it provides a uni-edway of developing -nite sample econometrics results for the errors which are non-i.i.d.and also non-normal. This includes errors which are dependent and=or heteroskedastic.In this sense, Theorem 1 provides a signi-cant improvement over the -nite sampleliterature often developed for the i.i.d. normal errors and the recent results of Ullah etal. (1995) for the i.i.d. non-normal errors.Second, from Theorem 1 and Corollaries 1 and 2, we observe that, up to O(�2); the

results for both normal and non-normal cases are the same. This is interesting sincethe -rst moment of econometric estimators is often analyzed by the terms up to O(�2).However, up to O(�4); the behavior of the results in the non-normal case can be quitediGerent from those in the normal case.Third, we observe from Theorem 1 and Corollary 1 that the results in general depend

crucially on the derivatives of h(y): Similarly from Corollary 2, it clearly follows thatthe magnitudes of entropy and divergence depend on the curvature properties of thedensities. Indeed for the Qat densities, the terms of O(�2) to O(�4) are 0. Also if f1 isa Qat density and f2 is convex then, up to O(�2); D(f1; f2)¿ 0; that is f2 is closerto g than f1:

3.3. Example

To see a more practical application of the results (20)–(22) in Corollary 2, weconsider the comparison of the closeness of large-n approximate density (f1) and thesmall-� approximate density (f2) with the unknown exact density of the t-ratio of thecoeAcient of the lag dependent variable in a dynamic model. For this purpose, let usconsider the dynamic model for i = 1; : : : ; n as

yi = +yi−1 + x′i- + �.i; (23)


where |+|¡ 1; yi is the ith observation on y; xi is the ith observation on the q × 1vector of regressors, + and - are parameters and .i is the disturbance term. We canwrite it as yi = +iy0 +

∑i−1s=0 +sx′i−s- + �

∑i−1s=0 +s.i−s or in matrix notation

y = ay0 + LX- + �L.= �∗ + �u; (24)

where y = (y1; : : : ; yn)′ is an n × 1 vector, �∗ = ay0 + LX- = +�∗−1 + X-; u = L.; X

is an n × k matrix, . is n × 1 vector, and a and L are n × 1 and n × n; respectively,given by

a=

+

+2

...

+n−1

; L=

1 0 0 · · · 0+ 1 0 · · · 0+2 + 1 · · · 0...

......

+n−1 +n−2 +n−3 · · · 1

: (25)

We assume that . is normally distributed with mean vector zero and covariancematrix �: Thus from (24) y ∼ N(�∗; 1); with 1 = L�L′: The least squares estimatorof + in (23) is

+=y′−1My

y′−1My−1

; (26)

where M = I − X (X ′X )−1X ′: The t-ratio for testing H0 : + = +0 against H1 : + �= +0 isgiven by

t =+− +0√AV (+)

; (27)

where AV (+)= s2(y′−1My−1)−1 and s2 is sum of squares of the least squared residuals

divided by n: It is well known that as n → ∞ the distribution of t is normal N(0; 1)given by

f1(t) =1√2�

e−(1=2)t2 : (28)

Further, Nankervis and Savin (1988) have shown that as � → 0 the distribution of tis the Student t at n− (q+ 1) = r degrees of freedom, that is

f2(t) =1√r(r + 1=2) 12(r=2)

(1 + (t2=r))−(r+1)=2: (29)

Our objective here is to see which of these two asymptotic densities is closer to theunknown density g(t) of the variable t whose -rst two moments are, say, E(t) = �and V (t) = �2 so that t = �+ �u is as in (12) with y replaced by a single variable t.


Table 1Values of D(f1; f2) for diGerent values of �; �2 and degrees of freedom r

r = 10 r = 30�2 �2

� 5 15 35 5 15 35

0 −0:26 −0:76 −1:26 −0:09 −0:25 −0:5910 43:30 48:71 59:52 26:35 31:99 43:2820 143:98 149:42 160:31 132:50 137:98 148:9640 728:53 733:67 743:96 714:02 719:19 729:53

r = 50 r = 70�2 �2

� 5 15 35 5 15 35

0 −0:05 −0:15 −0:35 −0:04 −0:12 −0:2910 9:24 14:80 25:94 2:65 6:13 17:0820 189:88 195:01 205:27 166:95 172:26 182:8840 786:54 791:57 801:63 757:54 762:63 772:81

Theorem 1 and Corollaries 1 and 2 can then be written with y replaced by t. Forour objective here we consider the relative divergence result D(f1; f2) in (22), up toO(�2), for the variable t. This can be written as

D(f1; f2) = log(cr

c

)+

�2

2− (r + 1)

2log

(1 +

�2

2

)

+�2

2

(1− (r + 1)

(r − �2)(r + �2)2

); (30)

where

cr =1√r�

⌈(r + 12

)⌈ r2

and c =1√2�

: (31)

This D(f1; f2) is evaluated in Table 1 for various values of �; �2 and the degreesof freedom r. The results in Table 1 show that D(f1; f2)¿ 0 for non-zero values of�, and D(f1; f2)¡ 0 for � = 0. It is well known that + in (26) is a biased estimatorin small to moderately large samples, hence Et = � will not be zero, unless n → ∞.In view of this the density f2; the asymptotic t distribution, is closer to the unknowndensity g(t) of t compared to the asymptotic normal f1. This result suggests theusing of the asymptotic t distribution f2 for the hypothesis testing based on (27).However, if � is close to zero then the asymptotic normal density f1 will be betterthan f2.


3.4. Derivation of results

Let us substitute (12) in (11) and write h(y)=h(�+�u): Further expanding h(�+�u)around � for small � and retaining terms to O(�4) we get

h(y) = h(�) + �41 + �242 + �343 + �444; (32)

where

41 =∑

i

ui

[@h(y)@yi

]y=�

; 42 =12

∑i

∑j

uiuj

[@2h(y)@yi@yj

]y=�

; (33)

43 =16

∑i

∑j

∑k

uiujuk

[@3h(y)

@yi@yj@yk

]y=�

; (34)

44 =124

∑i

∑j

∑k

∑l

uiujukul

[@4h(y)

@yi@yj@yk@yl

]y=�

: (35)

Now using (13), we note that

E41 = 0; E42 =12

∑i

∑j

�ij$ij; E43 =16

∑i

∑j

∑k

�ijk$ijk (36)

and

E44 =124

∑i

∑j

∑k

∑l

�ijk$ijkl; (37)

where $ij; $ijk; $ijkl; are as in (17). Finally taking expectations on both sides of (32)and using (36) and (37) we get the result in Theorem 1.

4. Kullback–Leibler divergence based non-parametric estimation and testing

4.1. Estimation

Suppose that we have n data {yi; xi}; i = 1; : : : ; n sampled from the non-parametricregression model

yi = m(xi) + ui; (38)

where yi is the dependent variable, xi is a 1× q vector of regressors, m(xi)=E(yi | xi)is the true but unknown regression function, and ui are a sequence of i.i.d. randomvariables from N(0; �2):If m(xi)=m(xi; %) is an a priori speci-ed parametric regression then one can construct

the estimator of m(xi) as m(xi) = m(xi; %) where % is the maximum likelihood (ML)or minimum Kullback–Leibler divergence estimator of % obtained by maximizing theaverage conditional log-density

1n

n∑i=1

logf(yi)−∫

f(y) dy =∫

logf(y) dF(y)−∫

f(y) dy; (39)


where logf(yi) = logf(yi; %) = − 12 log(2��

2) − 1=2�2(yi − m(xi; %))2 and F(y) =n−1�I(yi6y) is the empirical distribution function. Since

∫f(y) dy = 1; maximiza-

tion of (39) is the same as the maximization of the log likelihood, � logf(yi): Writ-ing the objective function as in (39) also shows that the maximization of the averagelog-likelihood is the same as the maximization of Shannon’s entropy or the minimiza-tion of the Kullback–Leibler divergence measure of f(y) from a uniform density. Thus,% is also the minimum Kullback–Leibler divergence estimator.

We note that if m(xi; %) is not a correctly speci-ed model, then the estimator m(xi)will be an inconsistent estimator of m(xi; %) = m(xi). In this case, an alternative is toestimate m(xi) by a consistent non-parametric regression estimator. One such estimatorcan be obtained by considering m(xi)=m(xi; %)=m(x) as a constant at the point x andmaximizing the local average log-density

1nhq

n∑i=1

K(xi − x

h

)logf(yi)

=− 12nhq

n∑i=1

K(xi − x

h

)(log(2��2) +

1�2 (yi − m(x))2

)(40)

with respect to m(x); where K(xi − x=h) is a decreasing function of the distance ofthe regressor vector xi from the point x = (x1; : : : ; xq); and h → 0 as n → ∞ is thewindow width which determines how rapidly the weights decrease as the distance ofxi from x increases. The local log-density is essentially the log-density weighted bythe kernel function. We note that the maximization of the local average density is thesame as the maximization of the local Shannon entropy or minimization of the localKullback–Leibler divergence from the uniform density. Furthermore, the maximizationof local log-density is the same as the minimization of local squared errors.

n∑i=1

(yi − m(x))2K(xi − x

h

)(41)

with respect to m(x): The local Kullback–Leibler divergence estimator so obtained is

m(x) = (5′K(x)5)−15′K(x)y; (42)

where K(x) is an n × n diagonal matrix with the diagonal elements K(xi − x=h); i =1; : : : ; n; y is an n×1 vector with elements yi; and 5 is an n×1 vector of unit elements.The m(x) is the well known Nadaraya and Watson kernel regression estimator or localconstant least squares (LCLS) estimator. For details on the asymptotic and small sampleproperties of m(x); see Pagan and Ullah (1999).A better alternative to the estimator in (42), in the smaller mean squared error

sense, is the local linear Kullback–Leibler divergence estimator which -ts local linesm(xi)=m(xi; %(x))= +(x)+ xi-(x)=Xi%(x) to the data yi; xi around the point x; whereXi = [1 xi] and %(x) = [+(x) -′(x)]′. Such an estimator is obtained by maximizing

1nhq

n∑i=1

K(xi − x

h

)[−12log(2��2)− 1

2�2 (yi − Xi%(x))2]

(43)


or by minimizing �K(xi − x=h)(yi − Xi%(x))2 with respect to %(x). This gives

%(x) = (X ′K(x)X )−1X ′K(x)y; (44)

where X is an n× (q + 1) matrix with the ith row Xi. The estimator of m(x) is nowgiven by

˜m(x) = +(x) + x-(x); (45)

where +(x) = (1 0)%(x) and -(x) = (0 1)%(x). For the properties of %(x), see Fan andGijbels (1996) and Pagan and Ullah (1999).We note that when h = ∞; K(xi − x=h) = K(0) is a constant so that the local

linear estimator in (44) becomes the global LS estimator of %. Furthermore, the localestimator can be interpreted as the estimator of the varying coeAcient model Xi%(xi)whereas the global estimator provides the estimation of the constant coeAcient modelXi%. An important point is that the local model Xi%(x) does not specify any parametricform of the variation of %(x). Examples of the varying coeAcients models where theparametric form of %(x) is used include the functional coeAcient autoregressive model(Chen and Tsay, 1993), the random coeAcient models (see Raj and Ullah (1981)),and the threshold autoregressive model (Tong, 1990), among others.

4.2. Hypothesis testing

Now let us consider the testing problem

H0: E(yi | xi) = m(xi; %);

H1: E(yi | xi) = m(xi); (46)

where in the special case of linearity, m(xi; %)=Xi%. For the above testing problem theconditional log-density functions of y under H0 and H1, respectively, are

1n

n∑i=1

logf0(yi) =−12log(2��2)− 1

2n�2

n∑i=1

(yi − m(xi; %))2 (47)

and

1n

n∑i=1

logf1(yi) =−12log(2��2)− 1

2n�2

n∑i=1

(yi − m(xi))2: (48)

Further substituting minimum Kullback–Leibler divergence estimator m(xi; %) and thelocal minimum Kullback–Leibler estimator ˜m(xi) from (39) and (45), respectively, andde-ning the residual sum of squares RSS0 and RSS1 as follows:

RSS0 =n∑

i=1

(yi − m(xi; %))2; RSS1 =n∑

i=1

(yi − ˜m(xi))2: (49)

We can write D(f0; f1) statistic for the testing problem in (46) as

D(f0; f1) =1n

n∑i=1

logf1(yi)f0(yi)

=12log

RSS0RSS1

12

[RSS0 − RSS1

RSS1

]: (50)


The null hypothesis is rejected when D is large. We note that D is an F or LR type teststatistic and it compares the parametric residual sum of squares with the non-parametricresidual sum of squares. The critical values of this statistic can be obtained by thebootstrap procedure in Li and Wang (1998). Further, the test statistic can be used fortesting the exclusion of regressors, varying coeAcients versus constant coeAcients, andother speci-cation problems. For example, in the case of testing for the exclusion ofone of the regressors, both RSS0 and RSS1 will be the non-parametric residual sumof squares; RSS1 will be the non-parametric residual sums of squares based on q + 1variables Xi whereas RSS0 will be the non-parametric residual sum of squares basedon q variables in Xi. For the statistical properties of mean and variance of RSS1 seeUllah and Zindewalsh (1992).We also note that the test statistic in (50) is essentially looking at the diGerence

RSS0−RSS1 (Ullah (1985)), which implies a test statistic for Eu2i0−Eu2i1=0, where ui0=yi−m(xi; %) and ui1=yi−m(xi) are parametric and non-parametric errors, respectively.But Eu2i1 = Eu2i0 − E(m(xi; %) − m(xi))2, where we use the result that E(ui0(m(xi) −m(xi; %))) = E(ui0E(ui0 | xi)) = E(E(ui0 | xi))2 = E(m(xi; %)−m(xi))2. Thus, we can alsowrite Eu2i0 − Eu2i1 = E(m(xi; %) − m(xi))2 = E(ui0E(ui | xi)). While our test statistic in(50) can be regarded as testing for Eu2i0 − Eu2i1 = 0, we note here that the alternativetest statistics by Ait-Sahalia et al. (1994) is based on testing E(m(xi; %)− m(xi))2 = 0and the conditional moment test statistic by Li and Wang (1998) is obtained by testingE(ui0E(ui0 | xi))=0. It will be interesting to extensively analyze the properties of thesealternative tests in a future study, although see an attempt in Lee and Ullah (2001).

5. Conclusion

In this study, we have provided a new result on the asymptotic expansion of theexpectation of a general class of functions of the data vector which is non-normali.i.d. We indicate that this result can be used to develop the moments of a large classof econometric estimators and test statistics. Another outcome of our result is that itprovides the asymptotic expansions of entropy and divergence functions. This result isused to provide a comparison between two distributions of a t-test statistic in a dynamicmodel. The second area of the paper deals with the results on the non-parametric esti-mation and speci-cation testing by using the Kullback and Leibler divergence measure.A new speci-cation test is proposed. It will be an interesting future study to comparethe power and size of the proposed test with the alternative non-parametric tests inliterature.

Acknowledgements

The author acknowledges the -nancial support from the Academic Senate, UCR.He is grateful to Amos Golan and two referees for their constructive and helpfulcomments. He is also thankful to J. Galbraith, V. Zindewalsh, R. Atukorala and S.Mishra for discussions and help related to the subject matter of this paper.


References

Ait-Sahalia, Y., Bickel, P.J., Stoker, T.M., 1994. Goodness-of--t tests for regression using kernel methods.Manuscript, University of Chicago.

Atukorala, R., 1999. The case of an information criterion for assessing asymptotic approximation ineconometrics. Ph.D. Thesis, Monash University.

Chen, R., Tsay, R.S., 1993. Functional-coeAcient autoregressive models. Journal of American StatisticalAssociation 88, 298–308.

Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley, New York.Fan, J., Gijbels, I., 1996. Local Polynomial Modeling and its Applications. Chapman and Hall, London.Golan, A., Judge, G.G., Miller, D.J., 1996. Maximum Entropy Econometrics: Robust Estimation with Limited

Data. Wiley, New York.Jaynes, E.T., 1957. Information theory and statistical mechanics. Physical Review 106, 620–630.Kadane, J.B., 1971. Comparison of k-class estimators when the disturbances are small. Econometrica 39,

723–737.Kapur, J.N., Kesavan, H.K., 1992. Entropy Optimization Principles with Applications. Academic Press, San

Diego.Kiviet, J.F., Philips, D.A., 1993. Alternative bias approximations in regressions with a lagged-dependent

variable. Econometric Theory 9, 62–80.Kullback, S., Leibler, R.A., 1951. On information and suAciency. Annals of Mathematical Statistics 22,

79–86.Lee, T., Ullah, A., 2001. Nonparametric bootstrap speci-cation testing in econometric models. Manuscript,

University of California, Riverside.Li, Qi., Wang, S., 1998. A simple consistent bootstrap test for a paradigm regression function. Journal of

Econometrics 87, 147–165.Maasoumi, E., 1993. A compendium to information theory in economics and econometrics. Econometrics

Reviews 12, 137–181.Nagar, A.L., 1959. The bias and moments matrix of the general k-class estimators of the parameters in the

structural equations. Econometrica 27, 575–595.Nankervis, J.C., Savin, N.E., 1988. The student’s t-approximation in a stationary -rst order autoregressive

model. Econometrica 56, 119–145.Pagan, A., Ullah, A., 1999. Nonparametric Econometrics. Cambridge University Press, New York.Raj, B., Ullah, A., 1981. Econometrics: A Varying CoeAcient Approach. Croom Helm, London.Shannon, C.E., 1948. The mathematical theory of communication. Bell System Technical Journal 27, 379–

423.Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.Srivastava, V.K., Dwivedi, T.D., Belinsky, M., Tiwari, R., 1980. A numerical comparison of exact, large

sample and small disturbance approximations of properties of k-class estimators. International EconomicReview 21, 249–252.

Tong, H., 1990. Nonlinear Time Series: A Dynamic System Approach. Clarendon Press, Oxford.Ullah, A., 1980. The exact, large-sample and small disturbance conditions of dominance of biased estimators

in linear models. Economic Letters 6, 339–344.Ullah, A., 1985. Speci-cation analysis of econometric models. Journal of Quantitative Economics 1, 187–209.Ullah, A., 1996. Entropy, divergence and distance measures with econometric applications. Journal of

Statistical Planning and Inference 49, 137–162.Ullah, A., Srivastava, V.K., 1994. Moments of the ratio of quadratic forms in nonnormal variables with

econometric examples. Journal of Econometrics 62, 129–141.Ullah, A., Zindewalsh, V., 1992. On the estimation of residual variance in nonparametric regression. Journal

of Nonparametric Statistics 3, 263–265.Ullah, A., Srivastava, V.K., Roy, N., 1995. Moments of the function of nonnormal vectors with econometric

estimators and test statistics. Econometric Reviews 14, 459–471.

uses of entropy and divergence measures for evaluating econometric approximations and inference

Documents