online supplementary material for grf proposal 2018: “new ...personal.cb.cityu.edu.hk › msawan...

Online Supplementary Material forGRF Proposal 2018:

“New Developments on Frequentist ModelAveraging in Statistics”

Alan WAN (P.I.),Xinyu ZHANG (Co-I.),

Guohua ZOU (Co-I.)

SummaryThis document provides the proofs of the preliminary theoretical results of the captioned GRFProposal. Note that the notations used in each part of this document pertain only to that partand may carry a different meaning from the same notations used in other parts of the document.

Part 1: Frequentist model averaging for high-dimensional quantile regression pp. 1-15Part 2: Statistical inference after model averaging pp. 15-27Part 3: Frequentist model averaging under inequality constraints pp. 27-34Part 4: Frequentist model averaging over kernel regression estimators pp. 34-48

1 Part 1: Frequentist model averaging for high-dimensional quantile regression

1.1 Model framework and covariate dimension reduction

Let y be a scalar dependent variable, x = (x1, x2, . . . , xp)T be a p-dimensional vector of

covariates, where p can be infinite. Let Dn = (xi, yi), i = 1, . . . , n be independent andidentically distributed (IID) copies of (x, y), where xi = (xi1, xi2, . . . , xip)

T . Assume that theτ th (0 < τ < 1) conditional quantile of y is given by

Qτ (y|x) = µi(τ) =

p∑j=1

θj(τ)xij, (1.1)

where θj(τ) is an unknown coefficient vector. This leads to the following linear QR model:

yi =

p∑j=1

θj(τ)xij + εi(τ), (1.2)

where εi(τ) ≡ yi −∑p

j=1 θj(τ)xij is an unobservable IID error term such that P (εi(τ) ≤0|xi) = τ . For simplicity, we will write µi = µi(τ), εi = εi(τ) and θj = θj(τ).

Clearly, not all of xj’s are useful for predicting y. Model selection attempts to selecta subset of covariates in x and use them as the covariates in a single working model forall subsequent analysis and inference. Model averaging, in contrast, attempts to combineresults obtained from different models, each based on a distinct subset of covariates. Asdiscussed in the proposal, model averaging has the important advantage over model selectionin that it shields against the choice of a very poor model, and there exists ample evidencesuggesting that model averaging is the preferred approach from a sampling-theoretic pointof view. The situation in hand is complicated by the colossal number of plausible modelsdue to the high dimensionality of x. When estimating the conditional mean of y, Ando & Li(2014) encountered a similar problem, and proposed a covariate dimension reduction methodwhereby the covariates are clustered according to their magnitudes of marginal correlationswith y intoM+1 groups, and the group containing covariates with near zero correlations withy is dropped. This results in M candidate models, each being formed by regressing y on oneof the remaining M clusters of covariates that survive the screening step.

When the interest is in regression quantiles, Ando & Li’s (2014) ranking by marginalcorrelation approach does not provide the flexibility for different covariates to be selectedat different quantiles. Here, we adopt an alternative framework based on marginal quantileutility proposed by He et al. (2013) for covariate ranking and screening. Let Mτ = j :Qτ (y|x) functionlly depends on xj be the set of informative covariates at quantile level τ ,Qτ (y|xj) the τ th conditional quantile of y given xj , and Qτ (y) the τ th unconditional quantileof y. He et al.’s (2013) approach is based on the premise that qτ (y|xj) = Qτ (y|xj)−Qτ (y) = 0iff y and xj are independent. Thus, qτ (y|xj) may be taken as a measure of the utility ofxj , and the closer is qτ (y|xj) to zero, the less useful is xj in explaining y at quantile levelτ . Let F be the distribution function of of y. The τ th unconditional quantile of y is thusξ0(τ) = infy : F (y) ≥ τ or F−1(τ). As well, let the sample analog of F be Fn for asample of yi’s, i = 1, · · · , n. Correspondingly, the sample analog of F−1(τ) is F−1

n (τ), theτ th quantile of Fn. We let F−1

n (τ) be the estimator of ξ0(τ), and denote it as ξ(τ).

For the estimation of Qτ (y|xj), He et al. (2013) suggested a B-spline approach. Todescribe it, assume that xj takes on values within the interval [u, v], and Qτ (y|xj) belongsto the class of functions F defined under Condition 1.1 in Subsection 1.4.1. Let u = s0 <s1 < . . . < sk = v be a partition of [u, v], and use si’s as knots for constructing N = k + lnormalized B-spline basis functions B1(t), . . . , BN(t), such that ‖Bk(·)‖∞ ≤ 1, where ‖ · ‖∞denotes the sup norm. Write πππ(t) = (B1(t), . . . , BN(t))T , and assume that fj(t) ∈ F. Thenfj(t) can be approximated by a linear combination of the basis functions πππ(t)Tβββ for someβββ ∈ RN . Define βββj = argminβββ∈RN

∑ni=1 ρτ (yi − πππ(xij)

Tβββ), and use

fj(xj) = πππ(xj)T βββj − ξ(τ). (1.3)

as an estimator of qτ (y|xj). We expect fj(xj) to be close to zero if xj is independent of y.

2

Thus, it makes sense to use

‖fj(xj)‖1n = n−1

n∑i=1

∣∣∣fj(xij)∣∣∣ = n−1

n∑i=1

∣∣∣πππ(xij)T βββj − ξ(τ)

∣∣∣as the basis for covariate screening and ranking. Hereafter, we refer to ‖fj(xj)‖1

n as themarginal quantile utility of xj .

Recall that the informative covariates are those that are useful for explaining y. Assumethat there are s (< p) informative covariates and let xT and xF be s × 1 and (p − s) × 1vectors containing the informative and non-informative covariates respectively. Without lossof generality, let xF correspond to the first s covariates in x. Thus, x = (xTT , xTF ).

Theorem 1.1. Under Conditions 1.1-1.8 in Subsection 1.4.1, we have

P

(maxj∈T c

∥∥∥fj(xj)∥∥∥1

n≥ min

j∈T

∥∥∥fj(xj)∥∥∥1

n

)→ 0 as n→∞,

where T = 1, . . . , s and T c = s+1, . . . , p are index sets corresponding to the informativeand non-informative covariates respectively.

Proof: See Subsection 1.4.2.

Remark 1.1. He et al. (2013) used ‖fj(xj)‖2n = n−1

∑ni=1 fj(xij)

2 as the basis for covariatescreening. Let νn be a predefined threshold value. The covariates that survive He et al.’s

(2013) screening procedure is contained in the subset Mτ = j ≥ 1 :∥∥∥fj(xj)∥∥∥2

n≥ νn.

He et al. (2013) proved that this procedure achieves the sure screening property (Fan & Lv,2008), i.e., at each τ ; P (T ⊂ Mτ ) → 1 as n → ∞, or the important covariates are almost

certainly retained as n increases to infinity. However, note that∥∥∥fj1(xj1)∥∥∥2

n≥∥∥∥fj2(xj2)∥∥∥2

n

does not always imply∥∥∥fj1(xj1)∥∥∥1

n≥∥∥∥fj2(xj2)∥∥∥1

n. Thus, while He et al.’s (2013) procedure is

useful for screening the covariates, the same cannot be said when the objective is to rank thecovariates.

Remark 1.2. Theorem 1.1 shows that based on our procedure, a more informative covariateis always ranked higher than a less informative covariate asymptotically. When the procedureis used for covariate screening, Theorem 1.1 implies that a more informative covariate cannotbe dropped while a less informative covariate is selected simultaneously.

He et al. (2013) recommend keeping the covariates corresponding to the top [n/log(n)]

values of ‖fj(xj)‖2n and dropping the rest, where [a] rounds a down to the nearest integer. We

adopt an analogous approach here by retaining only the subset of covariates that result in the[n/log(n)] highest values of ‖fj(xj)‖1

n.

1.2 Model averaging and a delete-one CV criterion

Having reduced the dimension of the covariates, our next step entails constructing the can-didate models. For each quantile level τ , we divide the [n/log(n)] covariates that survive

3

the above screening step into M approximately equal-size clusters based on their estimatedmarginal quantile utilities, placing the most influential covariates in the first cluster, the leastinfluential in the M th cluster, and so on. This results in M candidates models, where the mth

model includes the mth cluster of covariates, m = 1, · · · ,M . This idea of grouping covari-ates follows that of Ando & Li (2014), with the only major differences being that Ando &Li (2014) focused on mean estimation and their ranking of covariates is based on marginalcorrelations.

Write the mth candidate model as

yi = xTi(m)Θ(m) + εi =km∑j=1

θj(m)xij(m) + εi, (1.4)

where km denotes the number of covariates, Θ(m) = (θ1(m), . . . , θkm(m))T ,xi(m) = (xi1(m), . . . , xikm(m))

T ,xij(m) is a covariate and θj(m) is the corresponding coefficient, j = 1, . . . , km. The τ th QRestimator of Θ(m) in the above model is

Θ(m) ≡ argminΘ(m)∈Rkm

n∑i=1

ρτ (yi − xTi(m)Θ(m)). (1.5)

Let εi(m) ≡ yi− ΘT(m)xi(m), w ≡ (w1, . . . , wM)T be a weight vector in the unit simplex of RM

andW ≡ w ∈ [0, 1]M : 0 ≤ wk ≤ 1. The model averaging estimator of µi is thus

µi(w) =M∑m=1

wmxTi(m)Θ(m). (1.6)

Unlike the weights in Lu & Su (2015), our weights are not subject to the conventionalrestriction of

∑Mm=1 wm = 1. As in Ando & Li (2014), the removal of this restriction in our

context is justified by the fact our M models are not equally competitive, given the manner inwhich the covariates are included in the different models.

We propose selecting w by the jackknife or leave-one-out cross validation criteriondescribed as follows.

For m = 1, . . . ,M , let Θi(m) be the jackknife estimator of Θ(m) in model m with theith observation deleted. Consider the leave-one-out CV criterion

CVn(w) =1

n

n∑i=1

ρτ

(yi −

M∑m=1

wmxTi(m)Θi(m)

). (1.7)

The jackknife weight vector w = (w1, . . . , wM)T is obtained by choosing w ∈ W such that

w = argminw∈WCVn(w). (1.8)

Substituting w for w in (1.6) results in the following jackknife model averaging (JMA) esti-mator of µi for high-dimensional quantile regression:

µi(w) =M∑m=1

wmxTi(m)Θ(m). (1.9)

The calculation of the weight vector w is a straightforward linear programming problem.

4

1.3 Asymptotic property of estimator

This section is devoted to an investigation of the theoretical properties of the JMA. Specifi-cally, we show show that jackknife weight vector w is asymptotically optimal in the sense ofminimizing the following out-of-sample final prediction error:

FPEn(w) = E

[ρτ (y

∗ −M∑m=1

wmx∗Tm Θ(m)) | Dn

], (1.10)

where (y∗,x∗) is an independent copy of (y,x), x∗m = (x∗1(m), . . . , x∗km(m))

T and Dn =

(xi, yi), i = 1, . . . , n, x∗1(m), . . . , x∗km(m) are variables in x∗ that correspond to the km re-

gressors in the mth model.

Following the notations of Lu & Su (2015), let f(· | xi) and F (· | xi) denote theconditional probability density function (PDF) and cumulative distribution function (CDF) ofεi given xi respectively, and fy|x(· | xi) the conditional PDF of yi given xi. Consider thefollowing pseudo-true parameter

Θ∗(m) = argminΘ(m)∈RkmE[ρτ (yi − xTi(m)Θ(m))

]. (1.11)

Under Conditions 1.9-1.11 in Subsection 1.4.1, we can show that Θ∗(m) exists and is uniquedue to the convexity of the objective function. For m = 1, . . . ,M , define

A(m) = E[f(−ui(m) | xi)xi(m)x

Ti(m)

]= E

[fy|x(xTi(m)Θ

∗(m) | xi)xi(m)x

Ti(m)

], (1.12)

B(m) = E[ψτ (εi + ui(m))

2xi(m)xTi(m)

], (1.13)

andV(m) = A−1

m B(m)A−1m , (1.14)

where ui(m) = µi − xTi(m)Θ∗m is the approximation bias for the mth candidate model and

ψτ (εi) = τ − 1εi ≤ 0. Let k = max1≤m≤M km.

Theorem 1.2. Suppose Conditions 1.9-1.11 in Subsection 1.4.1 hold. Then w is asymptoti-cally optimal in the sense that

FPE(w)

infw∈W

FPE(w)= 1 + op(1). (1.15)

Proof: See Subsection 1.4.3.

Remark 1.3. Theorem 1.2 shows that the FMA estimator due to the jackknife weight vectoryields an out-of-sample final prediction error that is asymptotically identical to that of theFMA estimator that uses the infeasible optimal weight vector. This is similar to the result ofLu & Su (2015) who considered a low dimensional covariate setup. Note that Condition 1.11allows

∑Mm=1 km to go to infinity.

5

1.4 Proofs of Theorems 1.1 and 1.2

1.4.1 Assumptions and Lemmas useful for proving Theorems 1.1 and 1.2

Our derivation of results requires the following regularity conditions from He et al. (2013).

Condition 1.1. Let F be the class of function defined on [0, 1] whose lth derivative satisfiesthe following Lipschitz condition of order c∗ : |f (l)(s) − f (l)(t)| ≤ c0|s − t|c∗ , for somepositive constant c0 and s, t ∈ [u, v], where l is a nonnegative integer and c∗ ∈ [0, 1] satisfiesd = l + c∗ > 1

2.

Condition 1.2. minj∈MτE(Qτ (y|xj) − Qτ (y))2 ≥ c1n−τ for some 0 ≤ τ < 2d

2d+1and some

positive constant c1.

Condition 1.3. The conditional density fy|xj(·) is bounded away from 0 and∞ on [Qτ (y|xj)−ξ, Qτ (y|xj) + ξ], for some ξ > 0, uniformly in xj.

Condition 1.4. The marginal density function gj of xj , 1 ≤ j ≤ p, are uniformly boundedaway from 0 and∞.

Condition 1.5. The number of basis functions N satisfies N−dnτ = o(1) and Nn2τ−1 = o(1)as n→∞.

See He et al. (2013) for an explanation of these assumptions. In addition, we need thefollowing assumptions.

Condition 1.6. minj∈T

1

n

∑ni=1

∣∣πππ(xij)Tβββ0j − ξ0(τ)

∣∣−maxj∈T c

1

n

∑ni=1

∣∣πππ(xij)Tβββ0j − ξ0(τ)

∣∣ ≥ c >

0, where c is a constant, and T c = s + 1, . . . , p and T = 1, . . . , s are non-signal indexset and signal index set, respectively.

Condition 1.7. (i) logp = o(n1−4τ ). (ii) 0 < τ < 1/4. (iii) N = o(nτ ).

Condition 1.8. (i) E[max1≤j≤p

∥∥∥βj − β0j

∥∥∥]4

<∞.

(ii) E[max1≤j≤p

(λmax

(1n

∑ni=1πππ(xij)πππ(xij)

T))]

<∞.

Condition 1.6 facilitates the construction of the approximate true structures of T andT c based on marginal quantile utilities. It is similar to the condition that underlies Lemma3.2 of Ando & Li (2014). Condition 1.7 imposes constraints on the number of samples, thedimension of the models and the number of B-spline basis. Part (i) of Condition 1.7 suggeststhat the order of p is smaller than the exponent of n1−4τ . Part (ii) of Condition 1.7 can be foundin Theorem 3.3 of He et al. (2013), which allows the handling of ultra-high dimensionality,that is, p is allowed to grow at the exponential rate with respect to n1−4τ . Condition 1.8 placesrestrictions on βj and B-spline basis.

We also require the following uniformly integrable assumptions:

Condition 1.9. (i) (yi,xi), i = 1, . . . , n, are IID such that (1.2) holds.

(ii) P (εi(τ) ≤ 0 | xi) = τ a.s.

(iii)E(µ4i ) <∞ and supj≥1E(x8

ij) ≤ cx for some cx <∞.

6

Condition 1.10. (i)fy|x(· | xi) is bounded above by a finite constant cf and continuous overits support a.s.

(ii) There exist constants cA(m)and cA(m)

that may depend on km such that 0 < cAm ≤λmin(A(m)) ≤ λmax(A(m)) ≤ cfλmax(E[xi(m)x

Ti(m)]) ≤ cA(m)

<∞.

(iii) There exist constants cB(m)and cB(m)

that may depend on km such that 0 < cB(m)≤

λmin(B(m)) ≤ λmax(B(m)) ≤ cB(m)<∞.

(iv) (cA(m)+ cB(m)

)/km = O(c2A(m)

).

Condition 1.11. Let cA = min1≤m≤M cA(m), cB = min1≤m≤M cB(m)

, cA = max1≤m≤M cA(m),

cA = max1≤m≤M cB(m).

(i) As n→∞,M2k2logn/(n0.5cB)→ 0 and k

4(logn)4/(ncB)2)→ 0.

(ii) nMn−0.5L2kc3A/(cAcB) = o(1) for a sufficiently large constant L.

Conditions 1.9-1.11 are regular conditions on quantile regression taken from Lu & Su(2015), except for the assumption M2k

2logn/(n0.5cB)→ 0 (n→∞), which is similar to the

condition (6) in Ando & Li (2014) and can be reduced to Mk

n0.25/logn → 0 (n → ∞). It allowsM to go to infinity.

A main difficulty of our proof of theorems is that βββj , being the solution to a non-smoothobjective function, has no closed form expression. We make use of the following lemmas toovercome this difficulty.

Following He et al. (2013), let us define, for a given (y0, x0), where x0 = (x01, . . . , x0p),

βββ0j = argminβββ∈RNE[ρτ (y0 − πππ(x0j)

Tβββ)− ρτ (y0)].

Lemma 1.1. This is the same lemma as Lemma 3.2 of He et al. (2013). Let Conditions 1.1-1.5be satisfied. For any C > 0, there exist positive constants c2 and c3 such that

P ( max1≤j≤p

∥∥∥βj − β0j

∥∥∥ ≥ CN12n−τ ) ≤ 2p exp(−c2n

1−4τ ) + p exp(−c3N−2n1−2τ ).

Proof: See the proof of Lemma 3.2 in He et al. (2013).

Lemma 1.2. This Lemma is taken from Section 2.3.1 in Serfling (1980). Let 0 < τ < 1, ifξ0(τ) is the unique solution of T c(y−) ≤ τ ≤ T c(y), then

ξ(τ)→ ξ0(τ) with probability 1.

Proof: See the proof of theorem of Section 2.3.1 in Serfling (1980).

Remark 1.4. The above result asserts that ξ(τ) is strongly consistent for estimation of ξ0(τ),under mild restrictions on T c in the neighborhood of ξ0(τ).

Lemma 1.3. This lemma is taken from Lu & Su (2015). Suppose that Conditions 1.9-1.11hold. Let C(m) denote an lm × km matrix such that C0 = limn→∞C(m)C

T(m) exists and is

positive definite, where lm ∈ [1, km] is a fixed integer. Then

7

(i) ‖Θ(m) −Θ∗(m)‖ = Op(√

kmn

);

(ii)√nC(m)V

−1/2(m) [Θ(m) −Θ∗(m)]

d→ N(0, C0).

Proof: See the proof of Theorem 3.1 in Lu & Su (2015).

Lemma 1.4. This lemma is taken from Lu & Su (2015). Suppose that Conditions 1.9-1.11hold. Then

(i) max1≤i≤nmax1≤m≤M‖Θi(m) −Θ∗(m)‖ = Op(√n−1klogn);

(ii) max1≤m≤M‖Θ(m) −Θ∗(m)‖ = Op(√n−1klogn).

Proof: See the proof of Theorem 3.2 in Lu & Su (2015).

1.4.2 Proof of Theorem 1.1

By Lemma 1.1 and Condition 1.7, for ∀ε > 0, we have

N−2n4τP

(max1≤j≤p

∥∥∥βj − β0j

∥∥∥N− 12nτ ≥ ε

)≤ N−2n4τp(2 exp(−c2n

1−4τ ) + exp(−c3N−2n1−2τ ))→ 0 as n→∞. (1.16)

Also, by Cauchy-Schwarz’s inequality,

E

[max1≤j≤p

∥∥∥βj − β0j

∥∥∥N− 12nτ]2

= E

[max1≤j≤p

∥∥∥βj − β0j

∥∥∥N− 12nτ1

(max1≤j≤p

∥∥∥βj − β0j

∥∥∥N− 12nτ ≥ ε

)]2

+E

[max1≤j≤p

∥∥∥βj − β0j

∥∥∥N− 12nτ1

(max1≤j≤p

∥∥∥βj − β0j

∥∥∥N− 12nτ < ε

)]2

≤

[E

(max1≤j≤p

∥∥∥βj − β0j

∥∥∥)4] 1

2 [N−2n4τP

(max1≤j≤p

∥∥∥βj − β0j

∥∥∥N− 12nτ ≥ ε

)] 12

+ε2P

(max1≤j≤p

∥∥∥βj − β0j

∥∥∥N− 12nτ < ε

)→ 0 as n→∞. (1.17)

8

Furthermore, recognising that ‖fj(xj)‖1n = n−1

∑ni=1


∣∣∣, we have

P

(maxj∈T c‖fj(xj)‖1

n ≥ minj∈T‖fj(xj)‖1

n

)=P

(maxj∈T c

1

n

n∑i=1


∣∣∣ ≥ minj∈T

1

n

n∑i=1


∣∣∣)

=P

(maxj∈T c

1

n

n∑i=1

∣∣∣πππ(xij)T (βββj − βββ0j) + πππ(xij)

Tβββ0j − ξ0(τ) + ξ0(τ)− ξ(τ)∣∣∣

≥ minj∈T

1

n

n∑i=1

∣∣∣πππ(xij)T (βββj − βββ0j) + πππ(xij)

Tβββ0j − ξ0(τ) + ξ0(τ)− ξ(τ)∣∣∣)

≤P

(maxj∈T c

1

n

n∑i=1

∣∣∣πππ(xij)T (βββj − βββ0j)

∣∣∣+ maxj∈T c

1

n

n∑i=1

∣∣πππ(xij)T (βββ0j − ξ0(τ))

∣∣+∣∣∣ξ0(τ)− ξ(τ)

∣∣∣≥ −min

j∈T

1

n

n∑i=1


∣∣∣+ minj∈T

1

n

n∑i=1


∣∣+∣∣∣ξ0(τ)− ξ(τ)

∣∣∣) .By triangle inequality and Markov’s inequality, we have

P



n

)≤P

(maxj∈T c

1

n

n∑i=1


∣∣∣+ minj∈T

1

n

n∑i=1


∣∣∣+ 2∣∣∣ξ0(τ)− ξ(τ)

∣∣∣≥ min

j∈T

1

n

n∑i=1


∣∣−maxj∈T c

1

n

n∑i=1


∣∣)

≤E[maxj∈T c

1n

∑ni=1


∣∣∣+ minj∈T1n

∑ni=1


∣∣∣+ 2∣∣∣ξ0(τ)− ξ(τ)

∣∣∣]minj∈T

1n

∑ni=1 |πππ(xij)T (βββ0j − ξ0(τ))| −maxj∈T c

1n

∑ni=1 |πππ(xij)T (βββ0j − ξ0(τ))|

≤E[2 max1≤j≤p

1n

∑ni=1


∣∣∣+ 2∣∣∣ξ0(τ)− ξ(τ)

∣∣∣]minj∈T

1n


1n


.

By Lemma 1.2 and Theorem 2(c) of Ferguson (1996),

E∣∣∣ξ0(τ)− ξ(τ)

∣∣∣ = o(1) as n→∞.

Also, noting that[1

n

n∑i=1


∣∣∣]2× 12

≤∥∥∥βββj − βββ0j

∥∥∥λ 12max

(1

n

n∑i=1

πππ(xij)πππ(xij)T

),

9

then we obtain

P



n

)

≤E[2 max1≤j≤p

∥∥∥βββj − βββ0j

∥∥∥ ·max1≤j≤p λ1/2max

(1n


T)

+∣∣∣ξ0(τ)− ξ(τ)

∣∣∣]minj∈T

1n


1n


=E[2 max1≤j≤p


∥∥∥ ·max1≤j≤p λ1/2max

(1n


T)]

+ o(1)

minj∈T1n


1n


≤2

[E(

max1≤j≤p


∥∥∥)2]1/2

·[E(max1≤j≤p λmax

(1n


T))]1/2

+ o(1)

minj∈T1n


1n


Finally, by (1.17) and Conditions 1.6, 1.8 and 1.7 (iii), we have

P



n

)=

o(N1/2n−τ ) ·[E(max1≤j≤p λmax

(1n


T))]1/2

+ o(1)

minj∈T1n


1n


→0 as n→∞.

This completes the proof of Theorem 1.1.

1.4.3 Proof of Theorem 1.2

Similar to the proof of Theorem 3.3 in Lu & Su (2015), we have to show

supw∈W

∣∣∣∣CVn(w)− FPEn(w)

FPEn(w)

∣∣∣∣ = op(1) (1.18)

in the weight spaceW .

Based on the proof of Theorem 3.3 in Lu & Su (2015), we have

CVn(w)− FPEn(w)

= CV1n(w) + CV2n(w) + CV3n(w) + CV4n(w) + CV5n,

10

where

CV1n(w) =1

n

n∑i=1

[µi −

M∑m=1

wmxTi(m)Θi(m)

]ψτ (εi),

CV2n(w) =1

n

n∑i=1

∫ ∑Mm=1 wmxT

i(m)Θi(m)−µi

0

[1εi ≤ s − 1εi ≤ 0 − F (s|xi) + F (0|xi)]ds,

CV3n(w) =1

n

n∑i=1

∫ ∑Mm=1 wmxT

i(m)Θi(m)−µi

0

[F (s|xi) + F (0|xi)]ds

−Exi

[∫ ∑Mm=1 wmxT

i(m)Θi(m)−µi

0

[F (s|xi) + F (0|xi)]ds

],

CV4n(w) =1

n

n∑i=1

Exi

∫ ∑Mm=1 wmxT

i(m)Θi(m)−µi

0

[F (s|xi) + F (0|xi)]ds

−Exi

[∫ ∑Mm=1 wmxT

i(m)Θ(m)−µi

0

[F (s|xi) + F (0|xi)]ds

],

and

CV5n =1

n

n∑i=1

ρτ(εi) − E[ρτ(εi)].

To prove (1.18), we need to prove

(i)minw∈WFPEn(w) ≥ E[ρτ (ε)]− op(1);

(ii)supw∈W |CV1n(w)| = op(1);

(iii)supw∈W |CV2n(w)| = op(1);

(iv)supw∈W |CV3n(w)| = op(1);

(v)supw∈W |CV4n(w)| = op(1);

(vi)CV5n = op(1).

The proofs of (i) and (vi) can be found in Lu & Su (2015). In the following, we prove (ii)-(v).From Lu & Su (2015), it is shown that

CV1n(w) = CV1n,1(w)− CV1n,2(w),

where

CV1n,1(w) =1

n

n∑i=1

[µi −

M∑m=1

wmxTi(m)Θ∗(m)

]ψτ (εi)

and

CV1n,2(w) =1

n

n∑i=1

µi −

M∑m=1

wmxTi(m)

[Θi(m) −Θ∗(m)

]ψτ (εi).

To prove (ii), it suffices to show that supw∈W |CV1n,1(w)| = op(1) and supw∈W |CV1n,2(w)| =op(1).Recognising that E[CV1n,1(w)] = 0 and Var(CV1n,1(w)) = O(k/n), we haveCV1n,1(w) =

11

op(1) for each w ∈ W . When both M and k = max1≤m≤Mkm are finite, based on the proof ofTheorem 3.3 in Lu & Su (2015), we can apply the Glivenko-Cantelli Theorem (e.g., Theorem2.4.1 in van der Vaart & Wellner (1996)) to show that supw∈W |CV1n,1(w)| = op(1). FollowingLu & Su (2015), define the class of functions

G = g(·, ·; w) : w ∈ W,

where g(·, ·; w) : R× Rdx → R is

g(εi,xi; w) =1

n

n∑i=1

[µi −

M∑m=1

wmxTi(m)Θ∗(m)

]ψτ (εi).

Define the metric |·|1 onW where |w−w|1 =∑M

m=1 |wm−wm|, for any w = (w1, . . . , wM) ∈W and w = (w1, . . . , wM) ∈ W . The ε-covering number ofW in high-dimensional quantilemodel averaging isN (ε,W , | · |1) = O(1/εM). Lu & Su (2015) has proved that |g(εi,xi; w)−g(εi,xi; w)| ≤ cΘ|w −w|1 max1≤m≤M ‖xi(m)‖, where cΘ = max1≤m≤M ‖Θ∗(m)‖ = O(k

1/2),

and that E max1≤m≤M ‖xi(m)‖ < ∞ when M and k are finite implies that the ε-bracketingnumber of G with respect to the L1(P )-norm is given by N[ ](ε,G, L1(P )) ≤ C/εM for somefinite C. Thus, applying Theorem 2.4.1 in van der Vaart & Wellner (1996), we can concludethat G is Glivenko-Cantelli.

When either M → ∞ or k → ∞ as n → ∞, let hn = 1/(klogn) and introducegrids using regions of the form Wj = w : |w − wj|1 ≤ hn such that W is covered withN = O(1/hMn ) regions Wj, j = 1, . . . , N . Lu & Su (2015) proved that

supw∈W |CV1n,1(w)| ≤ max1≤j≤N

|CV1n,1(wj)|+ op(1)

and for any ε > 0,

P

(max

1≤j≤NCV1n,1(wj) ≥ 2ε

)≤ P

(max

1≤m≤M

1

n

n∑i=1

|bi(m)| · 1bi(m) ≤ en ≥ ε

)

+ P

(max

1≤m≤M

1

n

n∑i=1

|bi(m)| · 1|bi(m)| ≥ en ≥ ε

)= Tn1 + Tn2(say),

where en = (Mnk2)1/4 and bi(m) = µi − xTi(m)Θ

∗(m).

From Lu & Su (2015), Tn2 = o(1) and

Tn1 ≤ 2 exp

(− nε2

2kσ2 + 2ε(Mnk2)1/4/3

+ logM

),

where σ2 <∞ is a constant.

12

Thus, we only have to establish

exp

(− nε2

2kσ2 + 2ε(Mnk2)1/4/3

+ logM

)= o(1). (1.19)

Now, using Part (i) of Condition 1.11, we can obtain n3/[k2M(logM)4]→∞. In the follow-

ing, we let c and c′ be constants. By

6kσ2 2ε(Mnk2)1/4,[

(3nε2)

(4ε(Mnk2)1/2)4

]4

=34n3ε2

44Mk2 ≥

cn3

k2M(logM)4

→∞,[6kσ2logM

4ε(Mnk2)1/4

]≤[c′(kM)1/2

(Mn)1/4

]2

=c′2kM1/2

n1/2≤ c′2M

2k2logn

n1/2→ 0,

logM < logn n1/2/(M2k2) n3/(Mk

2)→∞,

we have

exp

(− nε2

2kσ2 + 2ε(Mnk2)1/4/3

+ logM

)

= exp

(−3nε2 − 6kσ2logM − 2ε(Mnk

2)1/4logM

6kσ2 + 2ε(Mnk2)1/4

)

= exp

−

[3nε2

4ε(Mnk2)1/4− 6kσ2logM

4ε(Mnk2)1/4− 2ε(Mnk

2)1/4logM

4ε(Mnk2)1/4

]→ 0. (1.20)

By Part (i) of Condition 1.11, Lemma 1.4 in Subsection 1.4.1 and the triangle inequality,

supw∈W|CV1n,2(w)| ≤ sup

w∈W

M∑m=1

wm1

n

n∑i=1

|xTi(m)[Θi(m) −Θ∗(m)]ψτ (εi)|

≤ M max1≤i≤n

‖Θi(m) −Θ∗(m)‖ max1≤m≤M

1

n

n∑i=1

‖xTi(m)‖

≤ Op(n−1/2Mk

1/2(logn)1/2)Op(k

1/2) = op(1). (1.21)

Hence supw∈W |CV1n,2(w)| = op(1).

Next, we prove (iii). Lu & Su (2015) have proved that

CV2n(w) = CV2n,1(w) + CV2n,2(w),

where

CV2n,1(w) =1

n

n∑i=1

∫ ∑Mm=1 wmxT

i(m)Θ∗

(m)−µi

0

[1εi ≤ s − 1εi ≤ 0 − F (s|xi) + F (0|xi)]ds

13

and

CV2n,2(w) =1

n

n∑i=1

∫ ∑Mm=1 wmxT

i(m)Θi(m)−µi

∑Mm=1 wmxT

i(m)Θ∗

(m)−µi

[1εi ≤ s − 1εi ≤ 0 − F (s|xi) + F (0|xi)]ds.

We have to show that supw∈W |CV2n,1(w)| = op(1) and supw∈W |CV2n,2(w)| = op(1).Recog-nising the fact that 1εi ≤ s − 1εi ≤ 0 − F (s|xi) + F (0|xi) ≤ 2, Lemma 1.4 and Part (i)of Condition 2.3, we have

CV2n,2(w) ≤ 2

n

n∑i=1

|M∑m=1

wmxTi(m)[Θi(m) −Θ∗(m)]|

≤ 2M max1≤i≤n

‖Θi(m) −Θ∗(m)‖ max1≤m≤M

1

n

n∑i=1

‖xTi(m)‖

= Op(Mn−1/2k1/2

(logn)1/2)Op(k1/2

) = op(1). (1.22)

The proof of supw∈W |CV2n,1(w)| = op(1) is analogous to that of supw∈W |CV1n,1(w)| =op(1) and thus omitted.

Now, let us establish (iv). Lu & Su (2015) have proved that

CV3n(w) = CV3n,1(w) + CV3n,2(w)

andCV3n,2(w) ≤ CV3n,21(w) + CV3n,22(w),

where

CV3n,1(w) =1

n

n∑i=1

∫ ∑Mm=1 wmxT

i(m)Θ∗

(m)−µi

0

[F (s|xi) + F (0|xi)]ds

−E

[∫ ∑Mm=1 wmxT

i(m)Θ∗

(m)−µi

0

[F (s|xi) + F (0|xi)]ds

],

CV3n,2(w) =1

n

n∑i=1

∫ ∑Mm=1 wmxT

i(m)Θi(m)−µi

∑Mm=1 wmxT

i(m)Θ∗

(m)−µi

[F (s|xi) + F (0|xi)]ds ,

−E

[∫ ∑Mm=1 wmxT

i(m)Θi(m)−µi

∑Mm=1 wmxT

i(m)Θ∗

(m)−µi

[F (s|xi) + F (0|xi)]ds

],

CV3n,21(w) =1

n

n∑i=1

∣∣∣∣∣M∑m=1

wmxTi(m)

[Θi(m) −Θ∗(m)

]∣∣∣∣∣ , and

CV3n,22(w) =1

n

n∑i=1

Exi

∣∣∣∣∣M∑m=1

wmxTi(m)

[Θi(m) −Θ∗(m)

]∣∣∣∣∣ .We have to show

supw∈W|CV3n,1(w)| = op(1),

supw∈W|CV3n,21(w)| = op(1),

14

andsupw∈W|CV3n,22(w)| = op(1).

we can prove that supw∈W |CV3n,1(w)| = op(1), analogous to the proof of supw∈W |CV1n,1(w)| =op(1). Also, from (1.22), supw∈W |CV3n,21(w)| = op(1). By the triangle and Cauchy-Schwarzinequalities, the fact that ATBA ≤ λmax(B)ATA for any symmetric matrix B, Lemma 1.4and Condition 2.3(i), we have

supw∈W

CV3n,22(w) ≤ supw∈W

1

n

n∑i=1

|M∑m=1

wmExi |xTi(m)[Θi(m) −Θ∗(m)]|

≤ supw∈W

1

n

n∑i=1

|M∑m=1

wm[Θi(m) −Θ∗(m)]TE[xi(m)x

Ti(m)][Θi(m) −Θ∗(m)]

12

≤ M max1≤m≤M

[λmax(Exi |xTi(m))]1/2 max

1≤i≤nmax

1≤m≤M‖Θi(m) −Θ∗(m)‖

= op(1) (1.23)

To prove (v), noting that |F (s|xi)−F (0|xi)| ≤ 1 and by equation (1.23), we can showthat

supw∈W

CV4n(w) ≤ supw∈W

1

n

n∑i=1

E[xi

∣∣∣∣∣M∑m=1

wmxTi(m)[Θi(m) − Θ(m)]

∣∣∣∣∣ = op(1).

This completes the proof.

2 Part 2: Statistical inference after model averaging

2.1 Model framework and notations

Let y be generated from the densityf(y,β,γ) (2.1)

on Ω, a measurable Euclidean space, where β and γ are q1× 1 and q2× 1 vectors of unknownparameters. We consider inference on µ = µ(f) when the unknowns are estimated by modelaveraging. Hjort & Claeskens (2003) considered the local misspecification framework

ftrue(y) = fn(y) = f(y,β0,γ0 + δ/√n),

where β0 is the true value of β, γ0 is fixed and known, and δ = (δ1, . . . , δq2) contains pa-rameters that signify the degrees of the model departures in directions 1, ..., q2. As discussedin the proposal, the local misspecification framework has the advantage of simplifying theasymptotic analysis but its realism has been subject to criticism. Here, we study the asymp-totic distribution of model averaging estimators under the general fixed parameter setup (2.1)without invoking local misspecification.

15

Let θ = (θ1, . . . , θq)T = (βT ,γT )T and θ0 denote the true value of θ, with θ ⊂ Θ ⊂

Rq. Define the likelihood function

Ln(θ) =n∏t=1

f(yt,θ),

and the log likelihood function ln(θ) = lnLn(θ). It is assumed that the first and second partialderivatives of f(y,θ) with respect to θ exist. Let

Ψ(y,θ) = ∂ ln f(y,θ)/∂θ,

andΨ(y,θ) = ∂2 ln f(y,θ)/∂θ∂θT .

Then the first and second partial derivatives of ln(θ) with respect to θ are ln(θ) =∑n

t=1 Ψ(yt,θ)

and ln(θ) =∑n

t=1 Ψ(yt,θ) respectively, and the Fisher Information matrix is

F(θ) = EθΨ(y,θ)Ψ(y,θ)T .

We combine M sub-models within (2.1). Denote km as the number of parameters inthemth sub-model. It is assumed that a sub-model contains all q1 elements in β and some of theq2 elements in γ. If all possible combinations of elements in γ are considered, then M = 2q2;if only nested candidate models are considered, then M = q2 + 1. Let S = 1, . . . , q2,Sm = i1, · · · , ikm−q1 ⊂ S, Scm = ikm−q1+1, · · · , iq2, the complement of Sm, and γj bethe j th element of γ. Write γSm = (γi1 , · · · , γikm−q1 )T , γScm = (γikm−q1+1

, · · · , γiq2 )T , andθm = (βT ,γTm)T , where γm = (γ1, γ2, · · · , γq2)T such that γScm = 0. Define the permutationmatrix

Πm = (eT1 , · · · , eTq1 , eTq1+i1

, · · · , eTq1+ikm−q1, eTq1+ikm−q1+1

, · · · , eTq1+iq2)T ,

where ej is a unit vector with the j th element being 1 and all other elements zero. Then we canwrite θ′m = Πmθm = (βT ,γTSm ,0

T )T = (θTSm ,0T )T , with θSm = (βT ,γTSm)T . Similarly, we

have θ0,m = Πmθ0 = (θT0,Sm ,γT0,Scm

)T , where θ0,Sm contains the first km elements of θ0,m andγ0,Scm contains the remaining q − km elements.

Denote

Fm(θ0,m) = Eθ0,mΨ(y,θ0,m)Ψ(y,θ0,m)T = ΠmF(θ0)ΠTm.

Assuming that partial derivatives exist, write

Ψm(y,θSm) = ∂ ln f(y,θm)/∂θSm ,

Ψm(y,θSm) = ∂2 ln f(y,θm)/∂θSmθTSm ,

Am,n(θSm) = n−1

n∑t=1

Ψm(yt,θSm),

Bm,n(θSm) = n−1

n∑t=1

Ψm(yt,θSm)Ψm(yt,θSm)T ,

16

lm,n(θSm) =n∑t=1

Ψm(yt,θSm), and

lm,n(θSm) =n∑t=1

Ψm(yt,θSm).

Assuming that expectations exist, write

Am(θSm) = E(

Ψm(y,θSm)), and

Bm(θSm) = E(Ψm(y,θSm)Ψm(y,θSm)T

).

Assuming that appropriate inverses exist, write

Cm,n(θSm) = Am,n(θSm)−1Bm,n(θSm)Am,n(θSm)−1, andCm(θSm) = Am(θSm)−1Bm(θSm)Am(θSm)−1.

The Fisher Information matrix of θ0,Sm is given by

Fm(θ0,Sm) = Eθ0,SmΨm(y,θ0,Sm)Ψm(y,θ0,Sm)T .

Denote θSm as the Maximum Likelihood estimator of θSm under the mth sub-model,which is the solution of the log-likelihood equation

∑nt=1 Ψm(y,θSm) = 0. Define θ′m =

(θTSm ,0T )T , where θSm is an estimator of θ∗Sm , the parameter vector which minimizes the

Kullback-Leibler Information Criterion (KLIC),

I (f(y,θ0) : f(y,θm),θ′m) = E

(ln

[f(y,θ0)

f(y,θm)

]). (2.2)

By writing θ∗′m = Πmθ∗m = (θ∗TSm ,0

T )T , we can obtain θ∗m. The parameters in θ∗m follow thesame order as in θ0. Taking expectations with respect to the true distribution, we have

I (f(y,θ0) : f(y,θm),θ′m) =

∫f(y,θ0) ln f(y,θ0)dν −

∫f(y,θ0) ln f(y,θm)dν. (2.3)

Define θm as the Maximum Likelihood estimator of θm under themth sub-model, so the modelaveraging estimator of θ is θ(w) =

∑Mm=1wmθm, where wm is weight for the mth sub-model

and w = (w1, . . . , wM)T , belonging to weight setW = w ∈ [0, 1]M :∑M

m=1wm = 1.Now, the AIC and BIC scores under the mth candidate model are:

AICm = −2∑n

t=1 ln f(yt, θm) + 2km and BICm = −2∑n

t=1 ln f(yt, θm) + km lnn

respectively. The model weights based on these scores are given by

wxIC,m = exp(−xICm/2)/M∑m=1

exp(−xICm/2), m = 1, . . . ,M, (2.4)

17

where xICm is the AIC or BIC score from the mth sub-model. The model average estimatorsresulting from these weights are commonly referred to as the S-AIC or S-BIC estimators,defined as

θ(wAIC) =M∑m=1

wAIC,mθm, and

θ(wBIC) =M∑m=1

wBIC,mθm (2.5)

respectively. The regularity conditions required for the asymptotic results are as follows. Alllimiting processes presented here are with respect to n → ∞ and the notations d−→, a.s.−→and

p−→ denote convergence in distribution, almost surely and in probability, respectively.We denote θj(wAIC), θj(wBIC) and θj,0 as the j th component of θ(wAIC), θ(wBIC) and θ0

respectively.

Condition 2.1. The density f(y,θ) is measurable in y for every θ in Θ, a compact subset ofRq, continuous in θ for every y in Ω, a measurable Euclidean space, and the true parameterpoint θ0 is identifiable.

Condition 2.2. (a) E(ln f(yt)) exists and |ln f(y,θm)| ≤ Km(y) for all θ in Θ, where Km(y)is integrable with respect to dν;

(b) I(f(y,θ0,m) : f(y,θm),θ′m) has a unique minimum at θ∗m in Θm.

Condition 2.3. (a) ∂ ln f(y,θm)/∂θi, i = 1, . . . , km, are bounded in absolute value by afunction integrable with respect to dν uniformly in some neighborhood of θ0.

(b) The second partial derivative of f(y,θm) with respect to θm exists and is continuous forall y, and may be passed under the integral sign in

∫f(y,θm)dν.

Condition 2.4. |∂2 ln f(y,θm)/∂θi∂θj| and |∂ ln f(y,θm)/∂θi · ∂ ln f(y,θm)/∂θj| , i, j =1, . . . , km are dominated by functions integrable with respect to dν for all y in Ω and θm inΘm.

Condition 2.5. (a) θ∗Sm is in the interior of Θm; (b) B(θ∗Sm) is nonsingular; (c) θ∗Sm is aregular point of Am(θSm), defined as the value for θSm such that Am(θSm) has a constantrank in some open neighborhood of θSm .

Condition 2.6. The Fisher Information F(θ0) is a positive definite matrix.

Condition 2.7. The derivatives |∂[∂f(y,θ)/θi · f(y,θ)]/∂θj|, i, j = 1, · · · , q, are dominatedby functions integrable with respect to ν for all θ in Θ, and the minimal support of f(y,θ)does not depend on θ.

Remark 2.1. Condition 2.2 ensures that the KLIC is well-defined. Condition 2.3(a) allowsus to apply the Uniform Law of Large Numbers. Condition 2.3(b) ensures that the first twoderivatives with respect to θSm exist. This condition allows us to apply Taylor’s Theoremand Mean Value Theorem for random functions. Condition 2.4 ensures that the derivativesare appropriately dominated by functions integrable with respect to dν. This in turns ensuresthat Am(θSm) and Bm(θSm) are continuous in θSm , and that we can apply the Uniform Lawof Large Numbers to Am,n(θSm) and Bm,n(θSm). These assumptions are adopted from White(1982) and Ferguson (1996).

18

2.2 Main results

Assume that the mtho model is the true model, i.e., θmo = θ0. Denote θ0 and θ0,m as the Max-

imum Likelihood estimator of θ0 and θ0,m under the true model respectively. Any sub-modelcontaining all regressors of the true model is referred to as an overfitted model, contained bythe setO. The remaining sub-models are referred to as underfitted models, contained by the setU . Let dm = exp

(κTPmoκ− κTPmκ

)/2 + kmo − km

, where κTPmoκ ∼ X 2(q−kmo) and

κTPmκ ∼ X 2(q − km). Let wom be a vector with the mth element taking on the value of unity

and other elements zeros andwAIC,m = I(m ∈ O/mo)dm + I(m = mo) /

1 +∑

m∈O/mo dm

,

where I(·) is an indicator function. Under the mth model, the likelihood ratio test statistic isgiven by

λm,n =supθ∈Θm

∏nt=1 f(yt,θ)

supθ∈Θ

∏nt=1 f(yt,θ)

=

∏nt=1 f(yt, θm)∏nt=1 f(yt, θo)

.

Lemma 2.1. Suppose that Conditions 2.1, 2.3 and 2.6 are satisfied and m ∈ O. Then we have

−2 lnλm,n = (Πmξn)T [(ΠmF(θ0)ΠTm)−1 −Hm]Πmξn + o(1)

d−→ κTPmκ ∼ χ2(q − km),

where ξn = 1√nln(θ0), κ ∼ N (0, Iq×q) and Pm = (ΠmF(θ0)ΠT

m)12 [(ΠmF(θ0)ΠT

m)−1 −Hm](ΠmF(θ0)ΠT

m)12 .

Proof: See Subsection 2.3.

Lemma 2.2. For any underfitted model s ∈ U , under Conditions 2.1,2.2 and Conditions2.3(b)-2.5, we have

n−1ln(θs) = E(ln f(yt,θ∗s)) + op(1). (2.6)


Lemma 2.3. Suppose that Conditions 2.1-2.6 are satisfied. Then

wAIC,md−→ wAIC,m, m = 1, 2, · · · ,M and wBIC ≡ (wBIC,1, . . . , wBIC,M)T

p−→ womo . (2.7)


Let ∆m be a diagonal selection matrix such that

∆m =

Iq1

δ1

δ2

. . .δq2

,

19

where Iq1 is an q1 × q1 identity matrix and

δj =

1, j ∈ Sm,0, j 6∈ Sm,

The asymptotic distributions of the S-AIC and S-BIC estimators are established in the follow-ing theorem.

Theorem 2.1. Suppose that Conditions 2.1-2.7 are satisfied. Then√n(θ(wAIC)− θ0)

d−→∑m∈O

(Gm/G)ΠTmHSmΠm∆mF(θ0)β, (2.8)

θj(wBIC)p−→ 0 for j 6∈ Smo , and θj(wBIC)

d−→ Zj for j ∈ Smo ,

where

Gm = exp

(ΠmF(θ0)β)T [Hm − (ΠmF(θ0)ΠTm)−1]ΠmF(θ0)β/2− km

,

G =∑

m∈OGm, β ∼ N(0,F−1(θ0)), Zj ∼ N (0, σj) and σj is the j th element on the diagonalof matrix Σ = ΠT

moHSmoΠmo∆moF(θ0)(ΠTmoHSmoΠmo∆mo)

T .


Denote θm = (θ1,m, · · · , θq,m)T , where θj,m is the jth component of θm. FollowingBuckland et al. (1997), we also consider the scaled SAIC and scaled SBIC estimators suchthat

∑k wxIC,mk = 1, where xIC is AIC or BIC. This leads to the scaled SAIC and scaled

SBIC estimators of θj , defined as

θj(wAICs) =

Mj∑k=1

wAIC,mk∑Mj

k=1 wAIC,mkθj,k and θj(wBICs) =

Mj∑k=1

wBIC,mk∑Mj

k=1 wBIC,mkθj,k (2.9)

respectively.

Denote θ(wAICs) = (θ1(wAICs), · · · , θq(wAICs)) and θ(wBICs) = (θ1(wBICs), · · · , θq(wBICs))

Next, we establish the asymptotic distributions of θ(wAICs) and θ(wBICs).

Corollary 2.1. Suppose that Conditions 2.1-2.7 are satisfied. Then

√n(θj(wAICs)− θ0,j)

d−→Mj∑k=1

1mk ∈ O(Gmk∑Mj

k=1 1mk ∈ OGmk

)∆mkF−1(θ0)∆mkF(θ0)ηj

and√n(θ(wBICs)− θ0)

d−→ β,

where

Gmk = exp

(ΠmkF(θ0)β)T [Hmk − (ΠmkF(θ0)ΠTmk

)−1]ΠmkF(θ0)β/2− km,

β ∼ N(0,F−1(θ0)), ηj is the jth component of β and

1mk ∈ O =

1, mk ∈ O,0, mk 6∈ O.

20

Proof: The corollary is a direct consequence of Theorem 2.1.

Furthermore, let the parameter of interest be µ = µ(θ), a smooth real-valued function.Denote the sub-model estimator as µm = µ(θm)). Then the model averaging estimators of µbased on SAIC weight and SBIC weight are

µ(wAIC) =M∑m=1

wAIC,mµm and µ(wBIC) =M∑m=1

wBIC,mµm,

respectively. Then by Theorems 2.1 and Theorem 7 in Ferguson (1996), we obtain the follow-ing corollary.

Corollary 2.2. Under the assumptions of Theorem 2.1 and assuming that µ(θ) = ∂µ∂θ

is con-tinuous in a neighborhood of θ0, we have

√n(µ(wAIC)− µ(θ0))

d−→∑m∈O

(Gm/G)µ(θ0)TΠTmHSmΠm∆mF(θ0)β (2.10)

and√n(µ(wBIC)− µ(θ0))

d−→ N (0, µ(θ0)TΣµ(θ0)). (2.11)

2.3 Proof of Theorem 2.1

Proof of Lemma 2.1. When m ∈ O, the last q − km components of the true value θ0,m are 0.Given Conditions 2.1, 2.3 and 2.6, and assuming that Theorems 18 and 22 of Ferguson (1996)are satisfied, then we have −2 lnλm,n = 2[ln(θ0) − ln(θm)]. Now, expanding ln(θm) aboutθ0,m yields:

ln(θm) = ln(θ0,m) + lm,n(θ0,m)(θ′m − θ0,m)− n(θ′m − θ0,m)T In(θ′m)(θ′m − θ0,m),

where In(θ′m) = − 1n

∫ 1

0

∫ 1

0vlm,n(θ0,m+uv(θ′m− θ0,m))dudv

a.s.−→ 12Fm(θ0,m), as in the proof

of Theorem 18 of Ferguson (1996). Let o(1) represent a random variable matrix with eachelement converging almost surely to 0 as n→∞. By lm,n(θ0,m) = 0, we have

−2 lnλm,n = −2(ln(θm)− ln(θ0))

= 2n(θm − θ0)T In(θm)(θm − θ0)

= 2n(Πm(θm − θ0))TΠmIn(θm)ΠTmΠm(θm − θ0)

= 2n(θ′m − θ0,m)T In(θ′m)(θ′m − θ0,m)

= n(θ′m − θ0,m)T (Fm(θ0,m) + o(1))(θ′m − θ0,m).

To ascertain the asymptotic distribution of√n(θ′m − θ0,m), consider the following expansion

of ln(θ′m) about θ0,m:

1√nlm,n(θ′m) =

1√nlm,n(θ0,m) +

1

n

∫ 1

0

lm,n(θ0,m + v(θ′m − θ0,m))dv√n(θ′m − θ0,m)

21

= (−Fm(θ0,m) + o(1))√n(θ′m − θ0,m).

Hence√n(θ′m − θ0,m) = −(Fm(θ0,m)−1 + o(1)) 1√

nlm,n(θ′m) and

−2 lnλm,n =1√nlm,n(θ′m)T (Fm(θ0,m)−1 + o(1))

1√nlm,n(θ′m). (2.1)

To seek the asymptotic distribution of lm,n(θ′m), consider the following expansion about θ0,m:

1√nlm,n(θ′m) =

1√nlm,n(θ0,m) +

1

n

∫ 1

0

lm,n(θ0,m + v(θ′m − θ0,m))dv√n(θ′m − θ0,m). (2.2)

For the mth sub-model, write Fm(θ0,m) as

Fm(θ0,m) =

km × km km × (q − km)Gm,1 Gm,2

(q − km)× km (q − km)× (q − km)Gm,3 Gm,4

and let

Hm =

(G−1m,1 00 0

).

Note that the first km components of lm,n(θ′m) are zero, yielding Hmlm,n(θ′m) = 0 and

Hm1√nlm,n(θ0,m) = Hm(Fm(θ0,m) + o(1))

√n(θ′m − θ0,m) =

√n(θ′m − θ0,m) + o(1)

as the last q− km components of θ′m and θ0,m are equal. Substituting this result into (2.2), weobtain

1√nlm,n(θ′m) = [I −Fm(θ0,m)Hm]

1√nlm,n(θ0,m) + o(1)

= [I − ΠmF(θ0)ΠTmHm]

1√n

Πmln(θ0) + o(1). (2.3)

From the Central Limit Theorem,

1√nln(θ0) =

√n

(1

nln(θ0)

)= ξn

d−→ ξ,

where ξ ∼ N (0,F(θ0)). Hence,

1√nlm,n(θ′m)

d−→ [I − ΠmF(θ0)ΠTmHm]Πmξ,

so that, from (2.1) and (2.3),

−2 lnλm,n =1√nlm,n(θ0,m)T [I −Fm(θ0,m)Hm]Fm(θ0,m)−1[I −Fm(θ0,m)Hm] (2.4)

22

× 1√nlm,n(θ0,m) + o(1)

=1√nlm,n(θ0,m)T [Fm(θ0,m)−1 −Hm]

1√nlm,n(θ0,m) + o(1) (by HmFm(θ0,m)Hm = Hm)

=1√n

(Πmln(θ0))T [(ΠmF(θ0)ΠTm)−1 −Hm]

1√n

Πmln(θ0) + o(1)

= (Πmξn)T [(ΠmF(θ0)ΠTm)−1 −Hm]Πmξn + o(1)

d−→ (Πmξ)T [(ΠmF(θ0)ΠTm)−1 −Hm]Πmξ

= κT (ΠmF(θ0)ΠTm)

12 [(ΠmF(θ0)ΠT

m)−1 −Hm](ΠmF(θ0)ΠTm)

12κ (2.5)

whereκ = F(θ0)−12ξ ∼ N (0, Iq×q). Let Pm = (ΠmF(θ0)ΠT

m)12 [(ΠmF(θ0)ΠT

m)−1−Hm](ΠmF(θ0)ΠTm)

12

be a projection and rank(Pm) = trace(Pm) = q − km. Hence,

−2 lnλm,nd−→ κTPmκ ∼ χ2(q − km). (2.6)

Proof of Lemma 2.2. Consider an underfitted model s ∈ U . Let Conditions 2.1-2.2, Condi-tions 2.3(b)-2.5, and Theorems 2.2 and 3.2 of White (1982) hold. Then we have

θSsa.s.−→ θ∗Ss (2.7)

and√n(θSs − θ∗Ss)

d−→ N(0, Cs(θ∗Ss)). (2.8)

As well, it can be shown that Cs,n(θSs)a.s.−→ Cs(θ

∗Ss

). Then by applying the Taylor’s Theoremargument of Roy (1957), we have

ln(θs) =n∑t=1

ln f(yt, θs)

= ln(θ∗s) +n∑t=1

Ψs(yt,θ∗Ss)(θSs − θ

∗Ss) +

n

2(θSs − θ∗Ss)

TAs,n(θSs + α(θ∗Ss − θSs))(θSs − θ∗Ss),

where α ∈ (0, 1). Given Conditions 2.1-2.2 and 2.3(b)-2.5, and the proof of Theorem 3.2 ofWhite (1982), we have E(Ψs(yt,θ

∗Ss

)) = 0. In addition, by the Laws of Large Numbers, wehave

n−1

n∑t=1

Ψs(yt,θ∗Ss)

p−→ E(Ψs(yt,θ∗Ss)), (2.9)

and

n−1ln(θ∗s) = n−1

n∑t=1

ln f(yt,θ∗s)

p−→ E(ln f(yt,θ∗s)). (2.10)

By Theorem 2.2 of White (1982), it can be shown that

As,n(θSs + α(θ∗Ss − θSs))a.s.−→ As(θ

∗Ss) (2.11)

23

Then, by (2.7)-(2.11), we obtain


Proof of Lemma 2.3. From Lemma 2.1, when m ∈ O, we have

wAIC,mw−1AIC,mo = expAICmo/2− AICm/2

= exp

−

n∑t=1

ln f(yt, θmo) +n∑t=1

ln f(yt, θm) + kmo − km

= exp

− ln

(n∏t=1

f(yt, θmo)/ n∏t=1

f(yt, θo)

)+ ln

(n∏t=1

f(yt, θm)/ n∏t=1

f(yt, θo)

)+ kmo − km

= exp (−2 lnλmo,n + 2 lnλm,n) /2 + kmo − km= exp

((Πmoξn)T [(ΠmoF(θ0)ΠT

mo)−1 −Hmo ]Πmξn − (Πmξn)T [(ΠmF(θ0)ΠT

m)−1 −Hm]Πmξn

+o(1))/2 + kmo − km

d−→ exp

(κTPmoκ− κTPmκ

)/2 + kmo − km

, (2.13)

where κTPmoκ ∼ X 2(q − kmo) and κTPmoκ ∼ X 2(q − km).

On the other hand, when s ∈ U , from Lemma 2.2, we have


Similarly, from the proof of Lemma 2.2, it can be proven that

n−1ln(θmo) = E(ln f(yt,θ0)) + op(1). (2.15)

By the definition of KLIC and Theorem of Bowden (1973), we have

E(ln f(yt,θ0))− E(ln f(yt,θ∗s)) = δ > 0. (2.16)

Hence, when s ∈ U , by (2.14)-(2.16), we have

wAIC,sw−1AIC,mo = exp(AICmo/2− AICs/2)

= exp

(−

n∑t=1

ln f(yt, θmo) +n∑t=1

ln f(yt, θs) + kmo − ks

)

= exp

(−n

(n−1

n∑t=1

ln f(yt, θmo)− n−1

n∑t=1

ln f(yt, θs)

)+ kmo − ks

)= exp

(−n(n−1ln(θmo)− n−1ln(θs)

)+ kmo − ks

)= [exp(−n)](n

−1ln(θmo )−n−1ln(θs)+(kmo−ks)/n)

= Op(exp(−n))p−→ 0 as n→∞. (2.17)

As wAIC ∈ W , (2.17) implies that wAIC,m = Op(exp(−n)). Using (2.13), (2.17) and

wAIC,m = wAIC,mw−1AIC,mo/

M∑m=1

wAIC,mw−1AIC,mo ,

24

we have wAICd−→ wAIC.

Similarly, we can show that for any overfitted model m and m 6= mo,

wBIC,mw−1BIC,mo = Op(1) · n(kmo−km)/2 p−→ 0,

and that for any underfitted model s ∈ U ,

wBIC,sw−1BIC,mo = [exp(−n)]n

−1ln(θmo )−n−1ln(θs)n(kmo−ks)/2 p−→ 0.

From above two formulae, we have wBIC

p−→ womo .

Proof of Theorem 2.1. When m ∈ O, from the proof of the Lemmas 2.1 and 2.3, we have

wAIC,m = wAIC,mw−1AIC,mo/

M∑m=1

wAIC,mw−1AIC,mo

= exp (−2 lnλmo,n + 2 lnλm,n) /2 + kmo − km /∑m∈O

exp

(−2 lnλmo,n + 2 lnλm,n) /2

+ kmo − km

+ op(1)

= exp

(Πmξn)T [Hm − (ΠmF(θ0)ΠTm)−1]Πmξn/2− km + o(1)

/∑m∈O

exp

(Πmξn)T

[(Hm − ΠmF(θ0)ΠTm)−1]Πmξn/2− km + o(1)

+ op(1)

d−→ exp

(Πmξ)T [Hm − (ΠmF(θ0)ΠTm)−1]Πmξ/2− km

/∑m∈O

exp

(Πmξ)T

[(Hm − ΠmF(θ0)ΠTm)−1]Πmξ/2− km

= exp

(ΠmF(θ0)β)T [Hm − (ΠmF(θ0)ΠT

m)−1]ΠmF(θ0)β/2− km

/∑m∈O

exp


= Gm/

∑m∈O

Gm = Gm/G, (2.18)

where

Gm = exp


β ∼ N (0,F−1(θ0)), and G =

∑m∈OGm.

On the other hand, when m ∈ U , from the proof of the Lemma 2.3, we obtain

wAIC,m = Op(exp(−n)) and wBIC,m = Op(exp(−n)n(kmo−km)/2). (2.19)

As Θ is a compact subset of Rq, by Theorems 2.2 and 3.2 of White (1982), we can conclude,for any sub-model m, that

θ∗m = O(1), θ0 = O(1), θma.s.−→ θ∗m, (2.20)

25

and

√n(θSm − θ∗Sm) = −A−1

m (θ∗Sm)1√n

n∑t=1

∂ ln f(yt,θ∗m)

∂θSm+ op(1). (2.21)

When m ∈ O, by the definition of KLIC and Theorem of Bowden (1973), we have θ∗Sm =

θ0,Sm and θ∗m = θ0. Then by Theorem 3.3 of White (1982), we have −A−1m (θ∗Sm) = G−1

Sm,1,

and hence

√n(θSm − θ∗Sm) = G−1

Sm,1

√n(

1

n

n∑t=1

∂ ln f(yt,θ∗m)

∂θSm) + op(1). (2.22)

From the proof of Lemma 2.1, we have√n( 1

n

∑nt=1

∂ ln f(yt,θ∗m)∂θSm

) = ξm,nd−→ ξm, where

ξm ∼ N (0, GSm,1). Hence

√n(θ(wAIC)− θ0) =

M∑m=1

wAIC,m

√n(θm − θ0)

=∑m∈U

wAIC,m

√n(θm − θ∗m + θ∗m − θ0) +

∑m∈O

wAIC,m

√n(θm − θ0)

=∑m∈U

Op(exp(−n))√nOp(1) +

∑m∈O

wAIC,m

√n(θm − θ0)

=∑m∈O

wAIC,m

√n(θm − θ0) + op(1)

=∑m∈O

wAIC,m

√n(θm − θ∗m) + op(1)

=∑m∈O

wAIC,m

√nΠ−1

m (θ′m − θ′m) + op(1)

=∑m∈O

wAIC,m

√nΠT

m((θTSm ,0T )T − (θ∗TSm ,0

T )T ) + op(1)

=∑m∈O

wAIC,m

√nΠT

m((θSm − θ∗Sm)T ,0Tq−km)T + op(1)

=∑m∈O

wAIC,mΠTm

(G−1Sm,1

00 0

)(1√n

∑nt=1

∂ ln f(yt,θ∗m)∂θSm

0

)+ op(1)

=∑m∈O

wAIC,mΠTmHSmΠm∆mξn

d−→∑m∈O

(Gm/G)ΠTmHSmΠm∆mF(θ0)β

(2.23)

26

Similarly, we can prove that

√n(θ(wBIC)− θ0) =

M∑m=1

wBIC,m

√n(θm − θ0)

=∑m∈U

wBIC,m

√n(θm − θ0) +

∑m∈O

wBIC,m

√n(θm − θ0)

=∑m∈U

Op(exp(−n)n(kmo−km)/2)√nOp(1) +

∑m∈O

wBIC,m

√n(θm − θ0).

d−→ΠTmoHSmoΠmo∆moF(θ0)β,

where ΠTmoHSmoΠmo∆moF(θ0)β ∼ N (0,Σ), and Σ = ΠT

moHSmoΠmo∆moF(θ0)(ΠTmoHSmoΠmo∆mo)

T .It is known that

∆mo =

Iq1

δ1

δ2

. . .δq2

,

where Iq1 is an q1 × q1 identity matrix and

δj =

1, j ∈ Smo ,0, j 6∈ Smo .

As the moth candidate model is the true model, when j 6∈ Smo , θj,0 = 0. If σj is the j elementon the diagonal of matrix Σ, then σj = 0. This leads to θj(wBIC)

p−→ θj,0 = 0. For j ∈ Smo ,√n(θj(wBIC)− θj,0)

d−→ Zj,

where Zj ∼ N (0, σj).

3 Part 3: Frequentist model averaging under inequality con-straints


This subproject expands the application of frequentist model averaging to models that containinequality constraints on parameters. Let yi be generated by the following data generatingprocess:

yi = µi + εi = xTi0θ0 + εi, i = 1, 2, ..., n,

where µi = xTi0θ0 is the conditional expectation of yi, xi0 = (xi1, xi2, ..., xi∞)T is an infinitedimensional vector of regressors, θ0 = (θ1, θ2, ..., θ∞)T is the vector of the unknown coeffi-cients, and εi’s are independently distributed errors with E(εi) = 0 and E(ε2i ) = σ2

i for all

27

i. In practice, it is likely that some regressors in xi0 will be omitted, resulting in the workingmodel

yi = xTi θ + ei, i = 1, 2, ..., n, (3.1)

where xi = (xi1, xi2, ..., xip)T , θ = (θ1, θ2, ..., θp)

T and p is fixed. The misspecification errorassociated with (3.1) is thus µi − xTi θ. We can write (3.1) as

Y = Xθ + e, (3.2)

where Y = (y1, ..., yn)T , X = (x1, ...,xn)T , and e = (e1, ..., en)T . In addition, we letµ = (µ1, ..., µn)T and Λ = diag(σ2

1, σ22, ..., σ

2n).

Suppose that the investigator’s interests center around M distinct sets of constraints allembedded within

Rθ ≥ 0, (3.3)

that can accommodate virtually all linear equality and inequality constraints encountered inpractice, where θ is a parameter vector andR is a known vector. Now, let M be fixed andRm

be an rm×p known matrix. WriteR = Rm for themth set of constraints on θ,m = 1, 2, ...,M .Thus, there are M candidate models each subject to a distinct, competing set of constraintson θ given by Rmθ ≥ 0, m = 1, 2, ...,M . We refer to the model subject to the mth set ofinequality constraintsRmθ ≥ 0 as the mth candidate model.

Let θ(m) be the least squares (LS) estimator of θ subject to Rmθ ≥ 0. This estimatoris obtained by quadratic programming and many software routines such as the solve.QP in theR package quadprog can be used to solve this problem. Write µ(m) = Xθ(m). The modelaverage estimator of µ is thus

µ(w) =∑M

m=1wmµ(m) =

∑M

m=1wmXθ(m), (3.4)

where w = (w1, w2, ..., wM)T is a weight vector in the following unit simplex ofRM :

H =w ∈ [0, 1]M :

∑M

m=1wm = 1

.

The model average estimator µ(w) smoothes across the estimators µ(m)’s, m = 1 · · · ,M ,each corresponds to distinct set of inequality constraints on θ. This is unlike Kuiper et al.’s(2011) GORIC-based selection approach that uses one of µ(m)’s. A description of the latterapproach is in order. The GORIC-based model selection estimator chooses the model with thesmallest

GORICm = −2logf(Y |Xθ(m)) + 2PTm,

where θ(m) is the order-restricted maximum likelihood estimator of θ, PTm = 1+∑p

l=1wl(p,W , Cm)lis the infimum of a bias term, p is the dimension of θ, W = XTΛ−1X , Cm = θ ∈Rp|Rmθ ≥ 0 is the set of θ that satisfies themth set of inequality constraints, andwl(p,W , Cm)is the level probability for the hypothesis of θ(m) ∈ Cm when θ ∈H = y = (y1, ..., yk)|y1 =... = yk. While Kuiper et al. (2011) considered a t-variate normal linear model with finite di-mensional vector of regressors, we consider a univarite linear model with infinite dimensionalvector of regressors without assuming normality of errors.

28

The biggest challenge confronting the model averaging approach is choosing modelweights. Here, we consider a J-fold cross-validation approach, which extends the Jackknifemodel averaging (JMA) approach proposed by Hansen & Racine (2012). The next subsec-tion introduces the method and develops an asymptotic theory for the resultant J-fold cross-validation model average (JCVMA) estimator.

3.2 A cross-validation weight choice method

The idea of cross-validation is to minimise the squared prediction error of the final estimatorby comparing the out-of-sample fit of parameter estimates using candidate grid choices ontraining data sets to holdout samples. Cross-validation is a technique routinely applied whenover-fitting may be an issue. To implement the method, we divide the data into J groupssuch that each group contains n0 = n/J observations. Let θ(−j)

(m) be the estimator of θ under

Rmθ ≥ 0 with the jth group removed from the sample. Write µ(−j)(m) = X(j)θ

(−j)(m) , where

X(j) is an n0 × p matrix containing observations in the (1 + (j − 1)n0, ..., jn0) rows of X .That is, we delete the jth group of observations from the sample and use the n − n0 sampleto estimate the coefficients θ subject to Rmθ ≥ 0. Then, based on the estimated coefficients,we predict the n0 observations that are excluded. We repeat this process J times until eachobservation in the sample has been held out once. This leads to the vector of estimators

µ(m) =

µ

(−1)(m)

µ(−2)(m)...

µ(−J)(m)

=

X(1)θ

(−1)(m)

X(2)θ(−2)(m)

...X(J)θ

(−J)(m)

=

X(1)

X(2)

. . .X(J)

θ

(−1)(m)

θ(−2)(m)...

θ(−J)(m)

=Aθ(m),

whereA is a block diagonal matrix containing observations ofX and θ(m) = (θ(−1)T(m) , ..., θ

(−J)T(m) )T

is a Jp×1 vector. Substituting θ for θ and µ for µ in the model average results in the followingjackknife model average estimator of µ:

µ(w) =∑M

m=1wmµ(m) =

∑M

m=1wmAθ(m).

Now, consider the following measure based on squared cross-validation errors:

CVJ(w) = ‖µ(w)− Y ‖2,

where ‖a‖2 = aTa. The optimal choice of w resulting from cross-validation is

w = arg minw∈H

CVJ(w). (3.5)

Substituting w for w in (3.4) results in µ(w), the JCVMA estimator of µ.

The efficiency of the JCVMA estimator µ(w) is evaluated in terms of the followingsquared error loss function:

L(w) = ‖µ(w)− µ‖2,

To prove the asymptotic optimality of µ(w), we need the following conditions:

29

Condition 3.1. For any θ(m), there exists a limit θ∗(m) such that

θ(m) − θ∗(m) = Op(n−1/2).

Condition 3.2. ξ−1n n1/2 = o(1) and ξ−2

n σ2∑M

m=1 ‖µ∗(m)−µ‖2 = o(1), where µ∗(m) = Xθ∗(m),ξn = inf

w∈HL∗(w), µ∗(w) =

∑Mm=1 wmµ

∗(m), L

∗(w) = ‖µ∗(w)− µ‖2, and σ2 = max1≤i≤n

σ2i .

Condition 3.3. max1≤j≤J

λ(

1n0X(j)TX(j)

)= Op(1), where λ(·) is the maximum singular value

of matrix.

Remark 3.1. Condition 3.1 places a condition on the rate of convergence of θ(m) to itslimit θ∗(m). This condition is derived from results by White (1982), who proved the con-sistency and asymptotic normality of maximum likelihood estimators of unknown parame-ters under a compact subset in misspecified models. The first part of Condition 3.2 impliesthat ξn → ∞, which in turn implies that all candidate models are misspecified. It also re-quires ξn to diverge at a faster rate than n1/2. The second part of Condition 3.2 means thatσ2∑M

m=1 ‖µ∗(m) − µ‖2diverges at a slower rate than ξ2n. This condition is commonly used

in other model averaging studies. See Wan et al. (2010), Liu & Okui (2013) and Ando & Li(2014). Condition 3.3 places a restriction on the maximal singular value of the jth block ofX .

The following theorem gives the asymptotic optimality of the JCVMA estimator:

Theorem 3.1. Suppose Conditions 3.1-3.3 hold. Then

L(w)

infw∈H

L(w)

p−→ 1. (3.6)

Theorem 3.1 suggests that model averaging using w as the weight vector is asymptoticallyoptimal in the sense that the resulting squared error loss is asymptotically identical to that ofthe infeasible best possible model average estimator. The next subsection gives the proof ofTheorem 3.1.

3.3 Proof of Theorem 3.1

Note that

CVJ(w)

= ‖µ(w)− Y ‖2

= ‖µ(w)− µ+ µ(w)− µ∗(w)− (µ(w)− µ∗(w)) + µ− Y ‖2

≤ ‖µ(w)− µ‖2 + ‖µ(w)− µ∗(w)‖2 + ‖µ(w)− µ∗(w))‖2 + ‖µ− Y ‖2

+2‖µ(w)− µ‖‖µ(w)− µ∗(w)‖+ 2‖µ(w)− µ‖‖µ(w)− µ∗(w))‖+ 2|(µ(w)− µ)Tε|+2‖µ(w)− µ∗(w)‖‖µ(w)− µ∗(w))‖+ 2‖µ(w)− µ∗(w)‖‖ε‖+ 2‖µ(w)− µ∗(w))‖‖ε‖

≤ ‖µ(w)− µ‖2 + ‖µ(w)− µ∗(w)‖2 + ‖µ(w)− µ∗(w))‖2

30

+2‖µ(w)− µ∗(w)‖‖µ(w)− µ∗(w)‖+ 2‖µ∗(w)− µ‖‖µ(w)− µ∗(w)‖+2‖µ(w)− µ∗(w)‖‖µ(w)− µ∗(w))‖+ 2‖µ∗(w)− µ‖‖µ(w)− µ∗(w))‖+2‖µ(w)− µ∗(w)‖‖ε‖+ 2|(µ∗(w)− µ)Tε|+ 2‖µ(w)− µ∗(w)‖‖µ(w)− µ∗(w)‖+2‖µ(w)− µ∗(w)‖‖ε‖+ 2‖µ(w)− µ∗(w)‖‖ε‖+ ‖ε‖2

= L(w) + Πn(w) + ‖ε‖2

and

L(w)

= ‖µ(w)− µ‖2

= ‖µ(w)− µ∗(w) + µ∗(w)− µ‖2

= ‖µ∗(w)− µ‖2 + ‖µ(w)− µ∗(w)‖2 + 2(µ(w)− µ∗(w))T (µ∗(w)− µ)

= L∗(w) + Ξn(w).

Hence to prove (3.6), it suffices to prove

(s) |Πn(w)/L∗(w)| = op(1) (3.7)

and

(s) |Ξn(w)/L∗(w)| = op(1). (3.8)

The proofs of (3.7) and (3.8) entail proving

(s)‖µ(w)− µ∗(w)‖2 = Op(1), (3.9)

(s)‖µ(w)− µ∗(w)‖2 = Op(1) (3.10)

and

(s)∣∣(µ∗(w)− µ)Tε

∣∣ /L∗(w) = op(1). (3.11)

Now, by Conditions 3.1 and 3.3 and the assumption that both J and M are fixed, we have

(s)‖µ(w)− µ∗(w)‖2

≤ max1≤m≤M

‖µ(m) − µ∗(m)‖2

= max1≤m≤M

‖X(θ(m) − θ∗(m))‖2

≤ max1≤m≤M

λ(n−1XTX

)n‖θ(m) − θ∗(m)‖2

≤ max1≤m≤M

λ(n−1

∑J

j=1X(j)TX(j)

)n‖θ(m) − θ∗(m)‖2

≤ max1≤m≤M

∑J

j=1λ(n−1X(j)TX(j)

)n‖θ(m) − θ∗(m)‖2

≤ max1≤m≤M

J max1≤j≤J

λ(J−1n−1

0 X(j)TX(j)

)n‖θ(m) − θ∗(m)‖2

≤ max1≤m≤M

max1≤j≤J

λ(n−1

0 X(j)TX(j)

)n‖θ(m) − θ∗(m)‖2

31

= O(1)nOp(n−1)

= Op(1),

which is (3.9). Next, we prove (3.10). By Conditions 3.1 and 3.3 and the assumption that bothJ and M are fixed, we have

(s)‖µ(w)− µ∗(w)‖2

≤ max1≤m≤M

‖µ(m) − µ∗(m)‖2

= max1≤m≤M

‖Aθ(m) −Xθ∗(m)‖2

= max1≤m≤M

∥∥∥∥∥∥∥∥∥∥A

θ

(−1)(m)

θ(−2)(m)...

θ(−J)(m)

−A

11...1

⊗ θ∗(m)

∥∥∥∥∥∥∥∥∥∥

2

= max1≤m≤M

trace

θ

(−1)(m) − θ∗(m)

θ(−2)(m) − θ∗(m)

...θ

(−J)(m) − θ∗(m)

T

ATA

θ

(−1)(m) − θ∗(m)

θ(−2)(m) − θ∗(m)

...θ

(−J)(m) − θ∗(m)

≤ max1≤m≤M

λ(n0−1ATA

)n0

∑J

j=1‖θ(−j)

(m) − θ∗(m)‖2

= O(1)n0JOp(n−1)

= Op(1),

where ‖θ(−j)(m) − θ∗(m)‖2 has the same convergence rate as ‖θ(m) − θ∗(m)‖2, due to the samples

being independent and the sample sizes associated with θ(−j)(m) and θ(m) having the same order.

Hence (3.10) holds. In addition, for any δ > 0, by Conditions 3.1 and 3.2, we obtain

pr

(s)ξ−1n

∣∣(µ∗(w)− µ)Tε∣∣ > δ

≤ pr

(s)ξ−1

n

∑M

m=1wm∣∣(µ∗(m) − µ)Tε

∣∣ > δ

= pr

max1≤m≤M

∣∣(µ∗(m) − µ)Tε∣∣ > ξnδ

≤

∑M

m=1pr∣∣(µ∗(m) − µ)Tε

∣∣ > ξnδ

≤ ξ−2n δ−2

∑M

m=1E

(µ∗(m) − µ)Tε2

≤ ξ−2n δ−2σ2

∑M

m=1‖µ∗(m) − µ‖2

= o(1),

which implies (3.11).

It is straightforward to show that

‖ε‖2 = Op(n). (3.12)

32

By (3.9), (3.10), (3.12) and Condition 3.2, it can be shown that

(s)‖µ∗(w)− µ‖‖µ(w)− µ∗(w)‖/L∗(w)

= (s)‖µ(w)− µ∗(w)‖/L∗1/2(w)

≤ ξ−1/2n (s)‖µ(w)− µ∗(w)‖

= op(1), (3.13)

(s)‖µ∗(w)− µ‖‖µ(w)− µ∗(w)‖/L∗(w)

= (s)‖µ(w)− µ∗(w)‖/L∗1/2(w)

≤ ξ−1/2n (s)‖µ(w)− µ∗(w)‖

= op(1), (3.14)

(s)‖µ(w)− µ∗(w)‖‖µ(w)− µ∗(w)‖/L∗(w)

≤ (s)‖µ(w)− µ∗(w)‖/L∗1/2(w)(s)‖µ(w)− µ∗(w)‖/L∗1/2(w)

≤ ξ−1/2n (s)‖µ(w)− µ∗(w)‖ξ−1/2

n (s)‖µ(w)− µ∗(w)‖= op(1), (3.15)

(s)‖µ(w)− µ∗(w)‖‖ε‖/L∗(w)

≤ (s)‖µ(w)− µ∗(w)‖/L∗1/2(w)(s)‖ε‖/L∗1/2(w)

≤ ξ−1/2n (s)‖µ(w)− µ∗(w)‖ξ−1/2

n (s)‖ε‖= op(1) (3.16)

and

(s)‖µ(w)− µ∗(w)‖‖ε‖/L∗(w)

≤ (s)‖µ(w)− µ∗(w)‖/L∗1/2(w)(s)‖ε‖/L∗1/2(w)

≤ ξ−1/2n (s)‖µ(w)− µ∗(w)‖ξ−1/2

n (s)‖ε‖= op(1). (3.17)

Combining (3.11)–(3.17), we obtain

(s) |Πn(w)/L∗(w)|= (s)

∣∣‖µ(w)− µ∗(w)‖2 + ‖µ(w)− µ∗(w))‖2

+ 2‖µ(w)− µ∗(w)‖‖µ(w)− µ∗(w)‖+ 2‖µ∗(w)− µ‖‖µ(w)− µ∗(w)‖+ 2‖µ(w)− µ∗(w)‖‖µ(w)− µ∗(w))‖+ 2‖µ∗(w)− µ‖‖µ(w)− µ∗(w))‖+ 2‖µ(w)− µ∗(w)‖‖ε‖+ 2|(µ∗(w)− µ)Tε|+ 2‖µ(w)− µ∗(w)‖‖µ(w)− µ∗(w)‖+ 2‖µ(w)− µ∗(w)‖‖ε‖+ 2‖µ(w)− µ∗(w)‖‖ε‖|

/L∗(w)

≤ ξ−1n (s)‖µ(w)− µ∗(w)‖2 + ξ−1

n (s)‖µ(w)− µ∗(w))‖2

+2ξ−1/2n (s)‖µ(w)− µ∗(w)‖ξ−1/2

n (s)‖µ(w)− µ∗(w)‖+ 2ξ−1/2n (s)‖µ(w)− µ∗(w)‖

+2ξ−1n (s)‖µ(w)− µ∗(w))‖2 + 2ξ−1/2

n (s)‖µ(w)− µ∗(w)‖+2ξ−1/2

n (s)‖µ(w)− µ∗(w)‖ξ−1/2n (s)‖ε‖

33

+2(s)|(µ∗(w)− µ)Tε|/L∗(w) + 2ξ−1/2n (s)‖µ(w)− µ∗(w)‖ξ−1/2

n (s)‖µ(w)− µ∗(w)‖+2ξ−1/2

n (s)‖µ(w)− µ∗(w)‖ξ−1/2n (s)‖ε‖+ 2ξ−1/2

n (s)‖µ(w)− µ∗(w)‖ξ−1/2n (s)‖ε‖

= op(1)

and

(s) |Ξn(w)/L∗(w)|= (s)

∣∣‖µ(w)− µ∗(w)‖2 + 2(µ(w)− µ∗(w))T (µ∗(w)− µ)∣∣ /L∗(w)

≤ ξ−1n (s)‖µ(w)− µ∗(w)‖2 + 2ξ−1/2

n (s)‖µ(w)− µ∗(w)‖= op(1),

which are (3.7) and (3.8) respectively. Then we have (3.6). This completes the proof of Theo-rem 3.1.

4 Part 4: Frequentist model averaging over kernel regres-sion estimators


We argue in the proposal that bandwidth selection in kernel regression is fundamentally amodel selection problem stemming from the uncertainty about the smoothness of the regres-sion function. If the goal is to derive a good estimator of the regression function, the derivingit by taking a weighted average of estimators based on different smoothness parameters willenlarge the space of possible estimators and may ultimately lead to a more accurate estimator.

Let us consider the following nonparametric regression model:

yi = f(xi) + ei, i = 1, 2, ..., n, (4.1)

where yi is the response, xi = (xi(1), xi(2), ..., xi(d))′ contains d covariates, f(xi) = µ(i) is an

unknown function of xi, and ei is an independently distributed error term with E(ei) = 0 andE(e2

i ) = σ2i . Equation (4.1) may be expressed in matrix form as

y = µ+ e,

where y = (y1, y2, ..., yn)′, µ = (µ(1), µ(2), ..., µ(n))′,X = (x1, x2, ..., xn)′, and e = (e1, e2, ...en)′.

Clearly, E(e) = 0 and E(e′e) = Ω = diag(σ21, σ

22, ..., σ

2n).

Let k(j)(·) be the kernel function corresponding to the jth element of xi. We assumethat k(j)(·) is a density function that is unimodal, compact and symmetric about 0 and has finitevariance. Let h(j) ∈ R+ be the bandwidth or smoothing parameter for xi(j) for j = 1, 2, ..., d,and h = h(1), h(2), ..., h(d). Write k(j)(·/h(j))/h(j) = kh(j)(·). When the dimension d islarger than one, the product kernel function

kPh (z) =d∏j=1

kh(j)(z(j))

34

may be used, where z = (z(1), z(2), ..., z(d))′ is a d-dimensional vector. Alternatively, one may

use the following spherical kernel function with a common bandwidth across the differentdimensions of the kernel function:

kSh (z) =kh(√z′z)∫

kh(√z′z)dz

,

where kh(u) = 1hdk(u

h). A popular choice for k is the standard d-variate Normal density

k(u) = (2π)−d/2 exp(−u2/2).

The most popular estimator of µ is the Nadaraya-Watson (Nadaraya, 1964; Watson, 1964) orlocal constant kernel estimator, defined as

µ = Ky =

kh(x1−x1)∑nj=1 kh(xj−x1)


· · · kh(xn−x1)∑nj=1 kh(xj−x1)



· · · kh(xn−x2)∑nj=1 kh(xj−x2)

...... . . . ...

kh(x1−xn)∑nj=1 kh(xj−xn)

kh(x2−xn)∑nj=1 kh(xj−xn)

· · · kh(xn−xn)∑nj=1 kh(xj−xn)

y,

where K is an n×n weight matrix comprising the kernel weights with the ijth element givenby Kij = kh(xj − xi)/

∑nl=1 kh(xl − xi).

An appropriate choice of h is essential to achieving a good estimator of µ. Bandwidthselection has always been an intense subject of debate despite an extensive body of literatureon the subject. The goal here is to consider not one but a combination of bandwidths anddevelop an estimator of µ on that basis. Let h1, h2, · · · , hMn each be a d-dimensional vectorof bandwidths, with hm = hm(1), hm(2), ..., hm(d) for m = 1, 2, ...,Mn. We allow Mn,the size of the bandwidth candidate set, to increase to infinity as n → ∞. It is well-knownthat if h(1), h(2), ..., h(d) all have the same order of magnitudes, then the optimal choice ofh(1), h(2), ..., h(d) for minimising mean squared error is h(j) ∼ n−1/(d+4) for j = 1, 2, ..., d; seeLi & Racine (2007).

Now, based on the mth candidate bandwidth, the kernel estimator of µ is

µm =

khm (0)∑n

j=1 khm (xj−x1)

khm (x2−x1)∑nj=1 khm (xj−x1)

· · · khm (xn−x1)∑nj=1 khm (xj−x1)

khm (x1−x2)∑nj=1 khm (xj−x2)

khm (0)∑nj=1 khm (xj−x2)

· · · khm (xn−x2)∑nj=1 khm (xj−x2)

...... . . . ...

khm (x1−xn)∑nj=1 khm (xj−xn)

khm (x2−xn)∑nj=1 khm (xj−xn)

· · · khm (0)∑nj=1 khm (xj−xn)

y ≡ Kmy.

An alternative to using one estimator of µ based on one of h1, h2, · · · , hMn as bandwidth isto form a weighted average of µm’s, m = 1, · · · ,Mn, each based on a different bandwidth.Now, let w = (w1, w2, ..., wMn)′ be a weight vector in the following unit simplex of RMn:

Hn =w ∈ [0, 1]Mn :

∑Mn

m=1wm = 1

.

We are interested in the following average or combined estimator of µ:

µ(w) =∑Mn

m=1wmµm =

∑Mn

m=1wmKmy ≡ K(w)y.

35

By enlarging the space of possible estimators, µ(w) allows us to gain information from allestimators obtained from the bandwidth candidate set. The major challenge confronting theimplementation of µ(w) is choosing appropriate weights for the individual estimators. Theweights should reflect how much the data support the different estimates based on differentbandwidths. Clearly, when w contains an element of one in one entry and zero everywhereelse, µ(w) reduces to the kernel estimator obtained using just one single bandwidth. Wepropose a leave-one-out cross validation method to choose weights in µ(w). As will be shown,this weight choice results in an estimator of µ with an optimal asymptotic property.

Now, let

µm =

0

khm (x2−x1)∑j 6=1 khm (xj−x1)

· · · khm (xn−x1)∑j 6=1 khm (xj−x1)

khm (x1−x2)∑j 6=2 khm (xj−x2)

0 · · · khm (xn−x2)∑j 6=2 khm (xj−x2)

...... . . . ...

khm (x1−xn)∑j 6=n khm (xj−xn)

khm (x2−xn)∑j 6=n khm (xj−xn)

· · · 0

y ≡ Kmy

be the leave-one-out kernel estimator of µ based on a given bandwidth, hm. Correspondingly,the leave-one-out model averaging estimator of µ is

µ(w) =∑Mn

m=1wmµm =

∑Mn

m=1wmKmy ≡ K(w)y.

Write em = y − µm and e = (e1, · · · , eMn). Our choice of w is based on a minimisation ofthe quadratic form

CVn(w) = ‖y − µ(w)‖2 = w′e′ew.

This yields w = arg minw∈Hn

CVn(w). Substituting w in µ(w) leads to the estimator µ(w), which

we shall refer to as the kernel averaging estimator (KAE). As CVn(w) is a quadratic functionin w, the computation of w is simple and straightforward.

4.2 Preliminary results on asymptotic optimality

We have developed some preliminary results on the asymptotic properties of µ(w). Our de-velopment is based on the squared error loss function

Ln(w) = ‖µ(w)− µ‖2.

We will show that Ln(w), the squared error loss of µ(w), is identical to infw∈HnLn(w), theinfimum of the squared error loss defined above, as n→∞.

To develop our theory, let A(w) = In −K(w) and define the conditional risk function

Rn(w) = E Ln(w)|X = ‖A(w)µ‖2 + trK ′(w)K(w)Ω. (4.2)

Conformable to the definitions of Ln(w) and Rn(w), the leave-one-out squared errorloss and risk functions may be written as

Ln(w) = ‖µ(w)− µ‖2

36

andRn(w) = E

Ln(w)|X

=∥∥∥A(w)µ

∥∥∥2

+ trK ′(w)K(w)Ω (4.3)

respectively, where A(w) = In − K(w). Let Km,ij be the ijth element of the matrix Km,ξn = infw∈Hn Rn(w), w0

m be an Mn × 1 vector with an element of 1 in the mth entry and 0everywhere else, and λ(·) be the maximum singular value of a matrix.

The KAE has the following asymptotic property:

Theorem 4.1. As n→∞, if

E(e4Gi ) ≤ κ <∞, for all i = 1, 2, ..., n, (4.4)

Mnξ−2Gn

Mn∑m=1

Rn(w0

m)G a.s.−−→ 0, (4.5)

for some constants κ and fixed integer G (1 ≤ G <∞),

max1≤m≤Mn

max1≤i≤n

n∑j=1

Km,ij = O(1), max1≤m≤Mn

max1≤j≤n

n∑i=1

Km,ij = O(1), a.s. (4.6)

µ′µ

n= O(1), a.s. (4.7)

and

max1≤m≤Mn

max1≤i,j≤n

khm(xj − xi)∑nl=1 khm(xl − xi)

= O(bn), a.s. (4.8)

where bn is required to satisfy

lim supn→∞

nb2nlog4(n) <∞, a.s. (4.9)

and

ξ−1n nb2

n = o(1), a.s. (4.10)

thenLn(w)

infw∈HnLn(w)

p−→ 1. (4.11)

By Theorem 4.1, µ(w), the KAE, is asymptotically optimal in the sense that its squared errorloss is asymptotically identical to that of the infeasible best possible averaging estimator. Theproof of Theorem 4.1 is given in the Subsection 4.3 below. The following remarks may bemade about the theorem and the conditions that form the basis for its validity.

37

Remark 4.1. Equation (4.4) imposes a restriction on the moments of the error term. Equation(4.5) has been widely used in studies of model averaging in linear regressions, includingpapers by Wan et al. (2010), Liu & Okui (2013) and Ando & Li (2014). As this condition hasnever been applied to nonparametric regression, we discuss in details its implications for ouranalysis in Subsection 4.3. Equation (4.6) places a limitation on the 1- and maximum-norms ofthe kernel function matrix Km and is almost the same as Assumption (i) of Speckman (1988).Equation (4.7) imposes a restriction on the average value of µ2

i , and is similar to the Condition(11) of Wan et al. (2010). Given that the kernel function cannot be negative, Equation (4.8)implies that all elements of the kernel function matrix go to zero. Equations (4.8) and (4.9)are the same as Assumption 1.3.3 (ii) of Hardle et al. (2000). If we assume ξn ∼ nd/(d+4),it is straightforward to obtain (4.10) from (4.9), implying that (4.10) is reasonable (note thatξ−1n nb2

n = ξ−1n log−4(n)×nb2

nlog4(n) ∼ n−d/(d+4)log−4(n)× equation (3.8) = o(1)×O(1) =o(1)). The rationality of assuming ξn ∼ nd/(d+4) is discussed in Subsection 4.3.

Remark 4.2. Suppose that we assume

supw∈Hn

∣∣∣∣∣Rn(w)

Rn(w)− 1

∣∣∣∣∣ a.s.−−→ 0, (4.12)

and

limn→∞

max1≤m≤Mn

λ(Km) ≤ κ1 <∞, a.s. (4.13)

for some constant κ1 as n → ∞. It can be seen from the proof of Theorem 4.1 in Subsec-tion 4.3 that (4.12) and (4.13) can be used in lieu of (4.7)-(4.10) as the conditions for theasymptotic optimality (3.6). Equation (4.12) requires the difference between the leave-one-outrisk function and the general risk function to decrease as n increases, and is almost identicalto Condition (A.5) of Hansen & Racine (2012) and Condition (10) of Zhang et al. (2013).Equation (4.13) bears a strong similarly to Condition (A.4) of Hansen & Racine (2012) andCondition (9) of Zhang et al. (2013). This condition places an upper bound on the maxi-mum singular value of the leave-one-out kernel function matrix Km. Even though (4.12) and(4.13) are more interpretable, we prefer (4.7)-(4.10) and consider them the formal conditionsfor Theorem 3.1 because they are less stringent than (4.12) and (4.13). That said, Equations(4.12) and (4.13) still carry significance for two reasons. First, as discussed above, these con-ditions are analogous to the conditions used in several other model averaging studies. Second,and more important, Theorem 4.1 is actually a direct result of (4.12) and (4.13); as shown inSubsection 4.3, we use Equations (4.7)-(4.10) to prove (4.12) and (4.13), which in turn lead toTheorem 4.1.

Thus, we have the following theorem as a supplement to Theorem 4.1:

Theorem 4.2. Other things being equal, the asymptotic optimality of µ(w) described in (4.11)still holds if Equations (4.7)-(4.10) are replaced by the stronger but more interpretable Equa-tions (4.12) and (4.13).

The proof of Theorem 4.2 is contained in Subsection 4.3.

38

4.3 Further technical details and proofs of Theorems 4.1 and 4.2

4.3.1 An Elaboration of Equation (4.5)

In this subsection, we explain the rationale of Equation (4.5) under a nonparametric setup. Forsimplicity, consider the situation where the bandwidths hm(j), j = 1, 2, ..., d, all have the sameorder of magnitude. From Li & Racine (2007), the mean squared error of the kernel estimatorfor the regression function is

MSEm(x)

= E(fm(x)− f(x)

)2

∼

(

d∑j=1

h2m(j)

)2

+1

nhm(1)hm(2) · · · hm(d)

, (4.14)

where fm(x) = 1n

∑ni=1

khm (xi−x)∑nj=1 khm (xj−x)

yi is the kernel estimator with a d-dimensional band-width hm = hm(1), hm(2), ..., hm(d). As n → ∞, hm(j) → 0, and the optimal choice ofhm(j) that minimises MSEm(x) is hm(j),opt ∼ n−1/(d+4). By the optimal choice of hm(j), we

have MSEm(x) ∼(∑d

j=1 h2m(j),opt

)2

∼ n−4/(d+4), and thus Rn(w0m) ∼ nd/(d+4) → ∞. As

ξn = infw∈Hn Rn(w), we have

ξn = O(nd/(d+4)). (4.15)

Thus, it is reasonable to assume that ξn ∼ nd/(d+4). If we apply the optimal choice of band-width, i.e., hm(j) ∼ n−1/(d+4), to all candidate models, then by (4.14) and ξn ∼ nd/(d+4), wehave

Mnξ−2Gn

Mn∑m=1

Rn(w0

m)G

∼M2n

n− 2dGd+4

nG(

d∑j=1

h2m(j)

)2G

+

(1

hm(1)hm(2) · · · hm(d)

)G

∼ max

M2nn− 2dGd+4nG

(d∑j=1

h2m(j)

)2G

, M2n

n−2dGd+4

(hm(1)hm(2) · · · hm(d))G

. (4.16)

As hm(j) ∼ n−1/(d+4), the two quantities inside the bracket in (4.16) have the same order.Thus, if

Mn = o(n

dG2(d+4)

),

then

Mnξ−2Gn

Mn∑m=1

Rn(w0

m)G ∼M2

n

(n−

2dGd+4

1

hdGm(j)

)∼M2

n

(n−

2dGd+4n

dGd+4

)a.s.−−→ 0,

39

which is Equation (4.5).

4.3.2 Useful Preliminary Results

In this subsection, we provide a lemma useful for proving Theorem 4.1. Let

Fm =

khm (0)∑

j 6=1 khm (xj−x1)khm (0)∑

j 6=2 khm (xj−x2)

. . .khm (0)∑

j 6=n khm (xj−xn)

,

and

Bm = In + Fm =

∑nj=1 khm (xj−x1)∑j 6=1 khm (xj−x1) ∑n

j=1 khm (xj−x2)∑j 6=2 khm (xj−x2)

. . . ∑nj=1 khm (xj−xn)∑j 6=n khm (xj−xn)

.

Then Km and Km satisfy the following relationship:

Km − In = Bm(Km − In) = (In + Fm)(Km − In). (4.17)

Lemma 4.1. If Equations (4.4), (4.6) and (4.8)-(4.10) hold, we obtain

max1≤m≤Mn

λ(Km) = O(1), a.s. (4.18)

λ(Ω) = O(1), a.s. (4.19)

max1≤m≤Mn

λ(Bm) = 1 +O(bn), a.s. (4.20)

max1≤m≤Mn

λ(Fm) = O(bn), a.s. (4.21)

max1≤m≤Mn

tr(Fm) = O (nbn) , a.s. (4.22)

max1≤m≤Mn

tr(Km) = O (nbn) , a.s. (4.23)

and

max1≤m,t≤Mn

tr(K ′mKt) = O (nbn) , a.s. (4.24)

where λ(·) denotes the maximum singular value of a matrix.

40

Proof. By Reisz’s inequality (Hardy et al., 1952; Speckman, 1988) and Equation (4.6), it canbe shown that

max1≤m≤Mn

λ(Km) ≤ max1≤m≤Mn

(max1≤i≤n

n∑j=1

|Km,ij| max1≤j≤n

n∑i=1

|Km,ij|

) 12

= O(1), a.s.

which is equation (4.18). Equation (4.4) guarantees that the second and higher order momentsof the random error term all exist. This implies (4.19) is true. By Equation (4.8), we have

max1≤m≤Mn

λ(Bm) = max1≤m≤Mn

max1≤i≤n

∑nl=1 khm(xl − xi)∑l 6=i khm(xl − xi)

= max1≤m≤Mn

max1≤i≤n

1

1− khm (0)∑nl=1 khm (xl−xi)

≤ 1

1− max1≤m≤Mn

max1≤i,j≤n

khm (xj−xi)∑nl=1 khm (xl−xi)

= 1 +O(bn), a.s.

which is (4.20). Similarly, by Equations (4.6) and (4.8), we have

max1≤m≤Mn

λ(Fm)

= max1≤m≤Mn

max1≤i≤n


− 1

= max1≤m≤Mn

max1≤i≤n

khm(0)∑l 6=i khm(xl − xi)

= max1≤m≤Mn

max1≤i≤n

khm(0)∑nl=1 khm(xl − xi)


= max1≤m≤Mn

max1≤i≤n


1

1− khm (0)∑nl=1 khm (xl−xi)

≤ max1≤m≤Mn

max1≤i≤n


1

1− max1≤i≤n

khm (0)∑nl=1 khm (xl−xi)

≤ max1≤m≤Mn

max1≤i,j≤n


1

1− max1≤m≤Mn

max1≤i,j≤n

khm (xj−xi)∑nl=1 khm (xl−xi)

= O(bn)(1 +O(bn))

= O(bn), a.s.

max1≤m≤Mn

tr(Fm) = max1≤m≤Mn

n∑i=1


41

≤ n max1≤m≤Mn

max1≤i≤n


= O (nbn) , a.s.

max1≤m≤Mn

tr(Km) = max1≤m≤Mn

n∑i=1


≤ n max1≤m≤Mn

max1≤i≤n


= O (nbn) , a.s.

and

max1≤m,t≤Mn

tr(K ′mKt)

= max1≤m,t≤Mn

n∑i=1

n∑j=1


kht(xj − xi)∑nl=1 kht(xl − xi)

≤ max1≤m,t≤Mn

n∑i=1

(max1≤i≤n

n∑j=1


)(max

1≤i,j≤n


)

= n

(max

1≤m≤Mn

max1≤i≤n

n∑j=1

Km,ij

)(max

1≤t≤Mn

max1≤i,j≤n


)= nO (1)O (bn)

= O (nbn) , a.s.

which are equations (4.21)-(4.24) respectively.

4.3.3 Proofs of Theorems 4.1 and 4.2

We begin our proof by assuming that Equations (4.12) and (4.13) hold, i.e.,

supw∈Hn

∣∣∣∣∣Rn(w)

Rn(w)− 1

∣∣∣∣∣ a.s.−−→ 0, (4.25)

and

limn→∞

max1≤m≤Mn

λ(Km) ≤ κ1 <∞, a.s. (4.26)

Part 1 provides the proof of Theorem 4.2, which states that µ is asymptotically optimal in thesense of (4.11) when (4.12) and (4.13) are fulfilled. In Part 2, we show that Equations (4.12)and (4.13) are implied by the weaker Equations (4.7)-(4.10).

Part 1.

42

Let ξn = infw∈Hn

Rn(w). Using Equations (4.12) and (4.5) together with equation (A.1)in Zhang et al. (2013), we obtain

Mnξ−2Gn

∑Mn

m=1

Rn(w0

m)G a.s.−→ 0. (4.27)

Our next task is to verify that

Ln(w)

infw∈Hn

Ln(w)

p−→ 1. (4.28)

Now, observe that

CVn(w) = Ln(w) + ‖e‖2 + 2µ′A′(w)e− 2e′K(w)e.

As ‖e‖2 is independent of w, to prove (4.28), it suffices to show that

supw∈Hn

∣∣∣µ′A′(w)e∣∣∣/Rn(w)

p−→ 0, (4.29)

supw∈Hn

∣∣∣e′K(w)e∣∣∣/Rn(w)

p−→ 0, (4.30)

and

supw∈Hn

∣∣∣Ln(w)/Rn(w)− 1∣∣∣ p−→ 0. (4.31)

For purposes of convenience, we assume thatX is non-random. This will not invalidateour proof, because all our technical conditions except (4.4) hold almost surely. Based on theproof of (A.5) in Zhang et al. (2013), we can obtain (4.29) directly. Using (4.27), Chebyshev’sinequality and Theorem 2 of Whittle (1960), it is observed that for any δ > 0,

Pr

supw∈Hn

∣∣∣e′K(w)e∣∣∣ /Rn(w) > δ

≤∑Mn

m=1Pr∣∣∣e′K(w0

m)e∣∣∣ > δξn

≤∑Mn

m=1δ−2Gξ−2G

n E

(Ω−12 e)′Ω

12 K(w0

m)Ω12 (Ω−

12 e)2G

≤ C1δ−2Gξ−2G

n

∑Mn

m=1trace

Ω

12 K(w0

m)ΩK ′(w0m)Ω

12

G≤ C1δ

−2Gξ−2Gn

∑Mn

m=1λ(Ω)Gtrace

K(w0

m)K ′(w0m)Ω

G≤ C2δ

−2Gξ−2Gn

∑Mn

m=1

R(w0

m)G→ 0, (4.32)

where C1 and C2 are constants. Expression (4.30) is implied by (4.32). Note that∣∣∣Ln(w)− Rn(w)∣∣∣ =

∣∣∣‖K(w)e‖2 − traceK ′(w)K(w)Ω

− 2µ′A′(w)K(w)e

∣∣∣ .43

Hence, to prove (4.31) it suffices to show that

supw∈Hn

∣∣∣µ′A′(w)K(w)e∣∣∣ /Rn(w)

p−→ 0 (4.33)

and

supw∈Hn

∣∣∣‖K(w)e‖2 − traceK ′(w)K(w)Ω

∣∣∣ /Rn(w)p−→ 0. (4.34)

By (4.19), (4.27), (4.4) and (4.13), (4.33) and (4.34) can be proved along the lines of proving(A.4) and (A.5) in Wan et al. (2010). This completes the proof of (4.28). Likewise, using(4.18) and the technique for deriving (4.31), we have

supw∈Hn

|Ln(w)/Rn(w)− 1| p−→ 0. (4.35)

Now, by (4.28), (4.31), (4.35) and (4.12), and following the proof of (A.16) in Zhang et al.(2013), we can obtain (4.11).

Part 2.

Given the proof of Theorem 4.2 shown in Part 1, to prove Theorem 4.1, we only needto prove that Equations (4.12) and (4.13) hold when Equations (4.7)-(4.10) are satisfied.

From (4.18) and (4.20), there exists a fixed number κ1 <∞ such that

limn→∞

max1≤m≤Mn

λ(Km) = limn→∞

max1≤m≤Mn

λBm(Km − In) + In

≤ limn→∞

max1≤m≤Mn

[λ(Bm)λ(Km) + λ(In)+ λ(In)]

≤ κ1 <∞. a.s.

This leads to Equation (4.13). By (4.17), we obtain

In − K(w) =Mn∑m=1

wm(In − Km)

=Mn∑m=1

wm(In + Fm)(In −Km)

= In −K(w) +Mn∑m=1

wmFm(In −Km), (4.36)

and

Km = (In + Fm)(Km − In) + In = Km + FmKm − Fm. (4.37)

Also, by (4.18), (4.21), (4.7) and (4.10), we have

ξ−1n sup

w∈Hn

∥∥∥∥∥

Mn∑m=1

wmFm(In −Km)

µ

∥∥∥∥∥2

44

= ξ−1n sup

w∈Hn

∣∣∣∣∣Mn∑m=1

Mn∑t=1

wmwtµ′(In −Km)′F ′mFt(In −Kt)µ

∣∣∣∣∣≤ ξ−1

n max1≤m,t≤Mn

|µ′(In −Km)′F ′mFt(In −Kt)µ|

= ξ−1n max

1≤m,t≤Mn

∣∣∣∣µ′((In −Km)′F ′mFt(In −Kt) + (In −K ′t)F ′tFm(In −Km)

2

)µ

∣∣∣∣≤ ξ−1

n max1≤m,t≤Mn

λ (In −Km)′F ′mFt(In −Kt) |µ′µ|

≤ ξ−1n max

1≤m≤Mn

λ2(In −Km) max1≤m≤Mn

λ2(Fm) |µ′µ|

≤ ξ−1n max

1≤m≤Mn

λ(In) + λ(Km)2 max1≤m≤Mn

λ2(Fm) |µ′µ|

= ξ−1n 1 +O(1)2O(b2

n)O(n)

= ξ−1n O(nb2

n)

= o(1). a.s. (4.38)

Furthermore, by (4.19), (4.21) and (4.24), and recognising that Ft and Ω are diagonalmatrices and the diagonal elements of K ′mFtKt are non-negative, we obtain

max1≤m,t≤Mn

|tr (K ′mFtKtΩ)| = max1≤m,t≤Mn

|tr diag(K ′mFtKt)Ω|

≤ λ(Ω) max1≤m,t≤Mn

tr (K ′mFtKt)

= λ(Ω) max1≤m,t≤Mn

tr (KtK′mFt)

≤ λ(Ω) max1≤t≤Mn

λ(Ft) max1≤m,t≤Mn

tr (KtK′m)

= O(1)O(bn)O(nbn)

= O(nb2n). a.s. (4.39)

Similarly, by (4.19), (4.21)-(4.24), we obtain

max1≤m,t≤Mn

|tr (K ′mFtΩ)| ≤ λ(Ω) max1≤t≤Mn

λ(Ft) max1≤m≤Mn

tr (Km)

= O(1)O(bn)O(nbn)

= O(nb2n), a.s. (4.40)

max1≤m,t≤Mn

|tr (K ′mFmKtΩ)| = max1≤m,t≤Mn

|tr diag(K ′mFmKt)Ω|


tr (K ′mFmKt)


tr (KtK′mFm)

≤ λ(Ω) max1≤m≤Mn

λ(Fm) max1≤m,t≤Mn

tr (KtK′m)

= O(1)O(bn)O(nbn)

45

= O(nb2n), a.s. (4.41)

max1≤m,t≤Mn

|tr (K ′mFmFtKtΩ)| = max1≤m,t≤Mn

|tr diag(K ′mFmFtKt)Ω|


tr (K ′mFmFtKt)


tr (KtK′mFmFt)

≤ λ(Ω) max1≤m≤Mn

λ2(Fm) max1≤m,t≤Mn

tr (KtK′m)

= O(1)O(b2n)O(nbn)

= O(nb3n), a.s. (4.42)

max1≤m,t≤Mn

|tr (K ′mFmFtΩ)| ≤ λ(Ω) max1≤m≤Mn

λ2(Fm) max1≤m≤Mn

tr (Km)

= O(1)O(b2n)O(nbn)

= O(nb3n), a.s. (4.43)

max1≤m,t≤Mn

|tr (FmKtΩ)| ≤ λ(Ω) max1≤m≤Mn

λ(Fm) max1≤t≤Mn

tr (Kt)

= O(1)O(bn)O(nbn)

= O(nb2n), a.s. (4.44)

max1≤m,t≤Mn

|tr (FmFtKtΩ)| ≤ λ(Ω) max1≤m≤Mn

λ2(Fm) max1≤t≤Mn

tr (Kt)

= O(1)O(b2n)O(nbn)

= O(nb3n), a.s. (4.45)

and

max1≤m,t≤Mn

|tr (FmFtΩ)| ≤ λ(Ω) max1≤m≤Mn

λ(Fm) max1≤t≤Mn

tr (Ft)

= O(1)O(bn)O(nbn)

= O(nb2n). a.s. (4.46)

By (4.37) and (4.39)-(4.46), we have

supw∈Hn

∣∣∣trK ′(w)K(w)Ω− tr K ′(w)K(w)Ω

∣∣∣= sup

w∈Hn

∣∣∣∣∣Mn∑m=1

Mn∑t=1

wmwt

tr(K ′mKtΩ

)− tr (K ′mKtΩ)

∣∣∣∣∣≤ sup

w∈Hn

Mn∑m=1

Mn∑t=1

wmwt

∣∣∣tr (K ′mKtΩ)− tr (K ′mKtΩ)

∣∣∣46

≤ max1≤m,t≤Mn

∣∣∣tr (K ′mKtΩ)− tr (K ′mKtΩ)

∣∣∣= max

1≤m,t≤Mn

∣∣∣tr(K ′mKt −K ′mKt

)Ω∣∣∣

= max1≤m,t≤Mn

|tr [(K ′m +K ′mFm − Fm)(Kt + FtKt − Ft)−K ′mKtΩ]|

= max1≤m,t≤Mn

|tr (K ′mFtKtΩ−K ′mFtΩ +K ′mFmKtΩ +K ′mFmFtKtΩ

− K ′mFmFtΩ− FmKtΩ− FmFtKtΩ + FmFtΩ)|

≤ max1≤m,t≤Mn

|tr (K ′mFtKtΩ)|+ max1≤m,t≤Mn

|tr (K ′mFtΩ)|+ max1≤m,t≤Mn

|tr (K ′mFmKtΩ)|

+ max1≤m,t≤Mn

|tr (K ′mFmFtKtΩ)|+ max1≤m,t≤Mn

|tr (K ′mFmFtΩ)|+ max1≤m,t≤Mn

|tr (FmKtΩ)|

+ max1≤m,t≤Mn

|tr (FmFtKtΩ)|+ max1≤m,t≤Mn

|tr (FmFtΩ)|

= O(nb2n) +O(nb2

n) +O(nb2n) +O(nb3

n) +O(nb3n) +O(nb2

n) +O(nb3n) +O(nb2

n)

= O(nb2n). a.s. (4.47)

Finally, by (4.2), (4.36), (4.38), (4.47) and Equation (4.10), we obtain

supw∈Hn

∣∣∣∣∣Rn(w)

Rn(w)− 1

∣∣∣∣∣= sup

w∈Hn

∣∣∣∣∣Rn(w)−Rn(w)

Rn(w)

∣∣∣∣∣≤ sup

w∈Hn

∣∣∣∣∣‖(In − K(w))µ‖2 − ‖(In −K(w))µ‖2

Rn(w)

∣∣∣∣∣+ξ−1

n supw∈Hn

∣∣∣trK ′(w)K(w)Ω − trK ′(w)K(w)Ω∣∣∣

= supw∈Hn

∣∣∣∣∣∣∣∥∥∥In −K(w) +

∑Mn

m=1 wmFm(In −Km)µ∥∥∥2

− ‖(In −K(w))µ‖2

Rn(w)

∣∣∣∣∣∣∣+ξ−1

n supw∈Hn


≤ supw∈Hn

∥∥∥∑Mn


Rn(w)

+2 supw∈Hn

∣∣∣∣∣∣µ′(In −K(w))′

∑Mn

m=1wmFm(In −Km)µ

Rn(w)

∣∣∣∣∣∣+ξ−1

n supw∈Hn


≤ supw∈Hn

∥∥∥∑Mn


Rn(w)

47

+2 supw∈Hn

∣∣∣∣∣∣‖µ′(In −K(w))′‖

∥∥∥∑Mn

m=1wmFm(In −Km)µ∥∥∥

Rn(w)

∣∣∣∣∣∣+ξ−1

n supw∈Hn


≤ supw∈Hn

∥∥∥∑Mn


Rn(w)

+2 supw∈Hn

∥∥∥∑Mn

m=1wmFm(In −Km)µ∥∥∥

R1/2n (w)

+ξ−1n sup

w∈Hn


≤ ξ−1n sup

w∈Hn

∥∥∥∥∥

Mn∑m=1

wmFm(In −Km)

µ

∥∥∥∥∥2

+2ξ−1/2n sup

w∈Hn

∥∥∥∥∥

Mn∑m=1

wmFm(In −Km)

µ

∥∥∥∥∥+ξ−1

n supw∈Hn


= ξ−1n O(nb2

n) + 2ξ−1n O(nb2

n)1/2 + ξ−1n O(nb2

n)

= o(1). a.s.

This completes the proof of Theorem 4.1.

ReferencesANDO, T. & LI, K. C. (2014). A model-averaging approach for high-dimensional regression.

Journal of the American Statistical Association 109, 254–265.

BOWDEN, ROGER (1973). The theory of parametric identification. Econometrica 41, 1069–1074.

BUCKLAND, S. T., BURNHAM, K. P. & AUGUSTIN, N. H. (1997). Model selection: anintegral part of inference. Biometrics 53, 603–618.

FAN, J. & LV, J. (2008). Sure independence screening for ultrahigh dimensional featurespace. Journal of the Royal Statistical Society (Series B) 70, 849–911.

FERGUSON, THOMAS SHELBURNE (1996). A course in large sample theory, vol. 49. Chap-man & Hall London.

HANSEN, B. E. & RACINE, J. S. (2012). Jackknife model averaging. Journal of Economet-rics 167, 38–46.

HARDLE, W., LIANG, H. & GAO, J. (2000). Partially Linear Models. Springer: Berlin.

HARDY, G.H., LITTLEWOOD, J.E. & POLYA, G. (1952). Inequalities. Cambridge UniversityPress: New York.

48

HE, X., WANG, L. & HONG, H. G. (2013). Quantile-adaptive model-free variable screeningfor high-dimensional heterogeneous data. The Annals of Statistics 41, 342–369.

HJORT, N. L. & CLAESKENS, G. (2003). Frequentist model average estimators. Journal ofthe American Statistical Association 98, 879–899.

KUIPER, R.M., HOIJTINK, H. & SILVAPULLE, M. (2011). An Akaike-type informationcriterion for model selection under inequality constraints. Biometrika 98, 495–501.

LI, Q. & RACINE, J.S. (2007). Nonparametric Econometrics: Theory and Practice. Prince-ton University Press: Princeton.

LIU, Q. & OKUI, R. (2013). Heteroskedasticity-robust cp model averaging. EconometricsJournal , 463–472.

LU, X. & SU, L. (2015). Jackknife model averaging for quantile regressions. Journal ofEconometrics 188, 40–58.

NADARAYA, E.A. (1964). On estimating regression. Theory of Probability and Its Applica-tions 9, 141–142.

ROY, KP (1957). A note on the asymptotic distribution of likelihood ratio. Calcutta StatisticalAssociation Bulletin 7, 73–77.

SERFLING, ROBERT J. (1980). Approximation theorems of mathematical statistics, vol. 162.John Wiley & Sons.

SPECKMAN, P. (1988). Kernel smoothing in partially linear models. Journal of the RoyalStatistical Society (Series B) 50, 413–436.

VAN DER VAART, A.W. & WELLNER, J. A. (1996). Weak Convergence and Empirical Pro-cesses with Application to Statistics. Springer, Berlin.

WAN, A.T.K., ZHANG, X. & ZOU, G. (2010). Least squares model averaging by Mallowscriterion. Journal of Econometrics 156, 277–283.

WATSON, G.S. (1964). Smooth regression analysis. Sankhya: The Indian Journal of Statistics(Series A) 26, 359–372.

WHITE, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica50, 1–25.

WHITTLE, P. (1960). Bounds for the moments of linear and quadratic forms in independentvariables. Theory of Probability and Its Applications 5, 302–305.

ZHANG, X., WAN, A.T.K. & ZOU, G. (2013). Model averaging by jackknife criterion inmodels with dependent data. Journal of Econometrics 174, 82–94.

49

online supplementary material for grf proposal 2018: “new ...personal.cb.cityu.edu.hk › msawan...

Documents