rates of convergence in m-estimation with an example from current status data
TRANSCRIPT
Rates of Convergence in M -Estimation
With an Example from Current Status Data
Lu Mao
Department of BiostatisticsThe University of North Carolina at Chapel Hill
Email: [email protected]
Lu Mao (UNC CH) Dec 4, 2013 1 / 23
1 M -EstimationMotivationZ-EstimationConsistencyDistributionRate of convergence
2 Example: Current Status DataDescription of problemCharacterization of MLEConsistency and distributional results
Lu Mao (UNC CH) Dec 4, 2013 2 / 23
MOTIVATION
Examples:
Parametric model: {pθ(x) : θ ∈ Θ}
θ0 = arg maxθ∈Θ
P log pθ
Regression: Y = g0(X) + ε, E(ε|X) = 0, g ∈ G
g0 = arg ming∈G
P (Y − g(X))2
Natural estimators:
MLEθ = arg max
θ∈ΘPn log pθ
LSEg = arg min
g∈GPn(Y − g(X))2
Lu Mao (UNC CH) Dec 4, 2013 3 / 23
INTRODUCTION
M -estimator:θn = arg max
θ∈ΘMn(θ)
Mn : Data-dependent criteria functionMn →M, and θ0 = arg maxθ∈Θ M(θ)Typically Mn(θ) = Pnmθ
Analysis steps:
Consistency: θn →p θ0
Rate of convergence: rn(θn − θ0) = Op(1)
Asymptotic distribution: rn(θn − θ0) Z
Lu Mao (UNC CH) Dec 4, 2013 4 / 23
Z-ESTIMATION
Special case: mθ smooth in parameter (Pmθ0 = 0)
(approx.) solve Pnmθn= op(n
−1/2)
Provided that
Gnmθn= Gnmθ0 + op(1) (Consistency + Donskerness)
we have
−√n(Pmθn
− Pmθ0) = Gnmθ0 + op(1)
−Vθ0√n(θn − θ0) = Gnmθ0 + op(1 + ||
√n(θn − θ0)||)
√n(θn − θ0) = −V −1
θ0Gnmθ0 + op(1)
where
Vθ0 =∂
∂θPmθ
∣∣∣∣θ=θ0
Lu Mao (UNC CH) Dec 4, 2013 5 / 23
Z-ESTIMATION
Example (Sample median)
θn = argminθ
Pn|X − θ|
or equivalently−1/n ≤ Pnsign(X − θn) ≤ 1/n
Since P sign(X − θ) = 2F (θ)− 1 ⇒ Vθ0 = 2f(θ0)
√n(θ − θ0) = (2f(θ0))
−1Gnsign(X − θ0) + op(1)
N(0, 4f(θ0)
−2)
Lu Mao (UNC CH) Dec 4, 2013 6 / 23
M -ESTIMATION
M -Estimation
Non-smoothness in parameterConstraint in parameterResulting estimator not root n consistent, i.e.
rn(θn − θ0) = Op(1), where rn 6=√n
Lu Mao (UNC CH) Dec 4, 2013 7 / 23
CONSISTENCY
Theorem 1.1 (VW Corollary 3.2.3)
Let Mn(θ) be a stochastic process indexed by a metric space Θ, and letM : Θ→ R be a deterministic function.
a Suppose ||Mn −M||Θ →p 0 and the “true parameter” θ0 satisfies
M(θ0) > supθ/∈G
M(θ)
for every open set G containing θ0. Then if Mn(θn) ≥ supθMn(θ)− op(1),
we have θn →p θ0.
b Suppose that ||Mn −M||K →p 0 for every compact K ⊂ Θ and that the mapθ 7→M(θ) is upper-semicontinuous with a unique maximum at θ0. Then the
same conclusion is true provided that θn = Op(1).
Lu Mao (UNC CH) Dec 4, 2013 8 / 23
CONSISTENCY
Example (Sample median)
θn = arg minθ
Pn|X − θ|
Use Theorem 1.1 b:
|θn| ≤ Pn|X − θn|+ Pn|X| ≤ 2Pn|X| = Op(1){| · −θ| : θ ∈ K} Glivenko-CantelliUniqueness of θ0 as minimizer of P |X − θ|
∂
∂θP |X − θ| = 2F (θ)− 1,
∂2
∂θ2P |X − θ| = 2f(θ)
strictly convex on the support of X
Conclusion: θn →p θ0
Lu Mao (UNC CH) Dec 4, 2013 9 / 23
CONSISTENCY
Theorem 1.2 (Wald)
Let θ 7→ mθ(x) be upper-semicontinuous for every x, and for every sufficientlysmall ball U ⊂ Θ
P supθ∈U
mθ <∞,
Then if θ0 is the unique maximizer of Pmθ, θn = Op(1) andPnmθn
≥ Pnmθ0 − op(1), we have
θn →p θ0.
Lu Mao (UNC CH) Dec 4, 2013 10 / 23
DISTRIBUTION
Suppose hn := rn(θn − θ0) = Op(1), then
hn = arg maxh
r2n
(Mn(θ0 + r−1
n h)−Mn(θ0))
=: arg maxh
Hn(h)
Theorem 1.3 (Argmax)
Suppose that Hn H in l∞(K) for every compact K ⊂ R, for a limit process
with continuous sample paths that have unqiue points of maxima h. Ifhn = Op(1) and Hn(hn) ≥ suphHn(h)− op(1), then
hn h.
Lu Mao (UNC CH) Dec 4, 2013 11 / 23
DISTRIBUTION
Example (Parametric MLE):
Hn(h) = r2nPn(log pθ0+r−1
n h − log pθ0)
= log
n∏i=1
pθ0+h/√n
pθ0(Xi)
= Gnh′ lθ0 −1
2h′Iθ0h+ op(1) (LAN)
h′Z − 1
2h′Iθ0h Z ∼ N(0, Iθ0)
=: H(h)
Therefore
√n(θn − θ0) = hn arg maxH(h) = I−1
θ0Z ∼ N(0, I−1
θ0)
Lu Mao (UNC CH) Dec 4, 2013 12 / 23
DISTRIBUTION
In general
Hn(h) = r2nPn(mθ0+r−1
n h −mθ0)
=r2n√nGn(mθ0+r−1
n h −mθ0) + r2nP (mθ0+r−1
n h −mθ0)
=r2n√nGn(mθ0+r−1
n h −mθ0) +1
2h′Vθ0h+ op(1)
(Vθ =
∂2
∂θ2Pmθ
) G(h) +
1
2h′Vθ0h
(1)for some zero-mean Gaussian process G.
Note that convergence of the Gn term concerns empirical processesindexed by
Fn :=r2n√nMK/rn , whereMδ = {mθ −mθ0 : d(θ, θ0) < δ}
Lu Mao (UNC CH) Dec 4, 2013 13 / 23
DISTRIBUTION
Empirical processes on index sets changing with n: VW Section 2.11
If (1) does hold, the variance function of G is given by
E(G(h)−G(g))2 = limn→∞
r4n
nP (mθ0+h/rn −mθ0+g/rn)2
The remaining (key) issue: finding rn
Lu Mao (UNC CH) Dec 4, 2013 14 / 23
RATE OF CONVERGENCE
Theorem 1.4 (Rate of Convergence, VW Theorem 3.2.5)
Let Mn be stochastic processes indexed by a semimetric space Θ andM : Θ→ R is a deterministic function, such that for every θ in aneighborhood of θ0,
M(θ)−M(θ0) . −d2(θ, θ0).
Suppose that for sufficiently small δ,
E supd(θ,θ0)<δ
|(Mn −M)(θ)− (Mn −M)(θ0)| . φn(δ)√n,
for functions φn such that δ 7→ φn(δ)/δα is decreasing for some α < 2. Let
r2nφn(1/rn) ≤
√n, for every n.
If the sequence θn satisfies Mn(θn) ≥Mn(θ0)−Op(r−2n ) and converges in
probability to θ0, thenrnd(θn, θ0) = Op(1).
Lu Mao (UNC CH) Dec 4, 2013 15 / 23
RATE OF CONVERGENCE
Remark:For empirical-type criteria function,
E||Gn||Mδ. φn(δ)
Use maximal inequality
E||Gn||Mδ. J[](1,Mδ, L2(P ))(PM2
δ )1/2,
where Mδ is the envelope function of Mδ, and if
J[](1,Mδ, L2(P )) =
∫ 1
0
√1 + logN[](ε||Mδ||P,2,Mδ, L2(P ))dε <∞,
uniformly in δ, then take
φn(δ) = (PM2δ )1/2
If φn(δ) = δα for some α < 2, then
rn = n1
2(2−α)
Lu Mao (UNC CH) Dec 4, 2013 16 / 23
RATE OF CONVERGENCE
Example (Lipschitz in parameter):
If for every θ1, θ2 in a neighborhood of θ0,
|mθ1 −mθ2 | ≤ m(x)||θ1 − θ2||,
with Pm2(x) <∞. Then
φn(δ) = (PM2δ )1/2 . δ.
This givesrn =
√n
Lu Mao (UNC CH) Dec 4, 2013 17 / 23
CURRENT STATUS
Interval censoring Case 1 (current status):Time-to-event data, examined only once
Example: A cross-sectional antibody test of people of various agesagainst Hepatitis A virus (Keiding, 1991)
Statistical problem: observe i.i.d. (U, δ),
U ∼ G on R+
δ = I(T ≤ U), T ∼ F on R+, T ⊥ UAim: estimate F
Method: nonparametric MLE (NPMLE)
Lu Mao (UNC CH) Dec 4, 2013 18 / 23
CHARACTERIZATION OF MLE
Regularity conditions: F0 and G admit Lebesgue densities f and grespectively
Likelihood:
ln(F ) = Pn(δ logF (U) + (1− δ) log(1− F (U)))
NPMLE: denote as Fn
Lu Mao (UNC CH) Dec 4, 2013 19 / 23
CHARACTERIZATION OF MLE
Theorem 2.1 (GW Proposition 1.2)
Re-order the observation times in ascending order such that U1 ≤ · · · ≤ Un.Let Hn be the greatest convex minorant (GCM) of the points (i,
∑ij=1 δj).
Then Fn(Ui) is the left derivative of Hn at i. Algebraically
Fn(Ui) = max1≤j≤i
mini≤k≤n
∑km=j δm
k − j + 1
Corollary 2.2
Denote Dn as the right continuous step function defined by the points(i/n, n−1
∑ij=1 δj), then
Fn(Ui) ≤ a iff arg mins{Dn(s)− as} ≥ i/n.
Lu Mao (UNC CH) Dec 4, 2013 20 / 23
CONSISTENCY OF MLE
Theorem 2.3 (Consistency of Fn(t))
Fix t, assume that f(t), g(t) > 0, then
Fn(t)→p F0(t)
Proof. See Example 5.17 ([V], pp 49) for a Wald’s consistency (Theroem1.2) proof, with (Θ, d)=the space of distribution functions equipped withthe weak topology.
Lu Mao (UNC CH) Dec 4, 2013 21 / 23
DISTRIBUTION OF MLE
Theorem 2.4 (Groeneboom, 1987)
Fix t, assume that f(t), g(t) > 0, then
n1/3{Fn(t)− F0(t)} (
4F0(1− F0)f
g(t)
)1/3
arg minh{Z(h) + h2},
where Z is a two-sided Brownian motion process originating from zero.
Proof. We first use Theorem 1.4 and the subsequent Remark to establishthat rn = n1/3, and then use the Argmax Theroem (Theorem 1.3) to findthe asymptotic distribution. See Example 3.2.15 ([VW, pp 298]) fordetails.
Lu Mao (UNC CH) Dec 4, 2013 22 / 23
References
Groeneboom, P. (1987). Asymptotics for interval censored observations. Report, 87, 18
[GW] Groeneboom, P., & Wellner, J. A. (1992). Information bounds and nonparametric maximum
likelihood estimation (Vol. 19). Springer.
[V] Van der Vaart, A. W. (2000). Asymptotic statistics (Vol. 3). Cambridge university press.
[VW] Van der Vaart, A. W., & Wellner, J. A. (1996). Weak Convergence and Empirical Processes.
Lu Mao (UNC CH) Dec 4, 2013 23 / 23