rates of convergence in m-estimation with an example from current status data

Rates of Convergence in M -Estimation

With an Example from Current Status Data

Lu Mao

Department of BiostatisticsThe University of North Carolina at Chapel Hill

Email: [email protected]

Lu Mao (UNC CH) Dec 4, 2013 1 / 23

1 M -EstimationMotivationZ-EstimationConsistencyDistributionRate of convergence

2 Example: Current Status DataDescription of problemCharacterization of MLEConsistency and distributional results

Lu Mao (UNC CH) Dec 4, 2013 2 / 23

MOTIVATION

Examples:

Parametric model: {pθ(x) : θ ∈ Θ}

θ0 = arg maxθ∈Θ

P log pθ

Regression: Y = g0(X) + ε, E(ε|X) = 0, g ∈ G

g0 = arg ming∈G

P (Y − g(X))2

Natural estimators:

MLEθ = arg max

θ∈ΘPn log pθ

LSEg = arg min

g∈GPn(Y − g(X))2

Lu Mao (UNC CH) Dec 4, 2013 3 / 23

INTRODUCTION

M -estimator:θn = arg max

θ∈ΘMn(θ)

Mn : Data-dependent criteria functionMn →M, and θ0 = arg maxθ∈Θ M(θ)Typically Mn(θ) = Pnmθ

Analysis steps:

Consistency: θn →p θ0

Rate of convergence: rn(θn − θ0) = Op(1)

Asymptotic distribution: rn(θn − θ0) Z

Lu Mao (UNC CH) Dec 4, 2013 4 / 23

Z-ESTIMATION

Special case: mθ smooth in parameter (Pmθ0 = 0)

(approx.) solve Pnmθn= op(n

−1/2)

Provided that

Gnmθn= Gnmθ0 + op(1) (Consistency + Donskerness)

we have

−√n(Pmθn

− Pmθ0) = Gnmθ0 + op(1)

−Vθ0√n(θn − θ0) = Gnmθ0 + op(1 + ||

√n(θn − θ0)||)

√n(θn − θ0) = −V −1

θ0Gnmθ0 + op(1)

where

Vθ0 =∂

∂θPmθ

∣∣∣∣θ=θ0

Lu Mao (UNC CH) Dec 4, 2013 5 / 23

Z-ESTIMATION

Example (Sample median)

θn = argminθ

Pn|X − θ|

or equivalently−1/n ≤ Pnsign(X − θn) ≤ 1/n

Since P sign(X − θ) = 2F (θ)− 1 ⇒ Vθ0 = 2f(θ0)

√n(θ − θ0) = (2f(θ0))

−1Gnsign(X − θ0) + op(1)

N(0, 4f(θ0)

−2)

Lu Mao (UNC CH) Dec 4, 2013 6 / 23

M -ESTIMATION

M -Estimation

Non-smoothness in parameterConstraint in parameterResulting estimator not root n consistent, i.e.

rn(θn − θ0) = Op(1), where rn 6=√n

Lu Mao (UNC CH) Dec 4, 2013 7 / 23

CONSISTENCY

Theorem 1.1 (VW Corollary 3.2.3)

Let Mn(θ) be a stochastic process indexed by a metric space Θ, and letM : Θ→ R be a deterministic function.

a Suppose ||Mn −M||Θ →p 0 and the “true parameter” θ0 satisfies

M(θ0) > supθ/∈G

M(θ)

for every open set G containing θ0. Then if Mn(θn) ≥ supθMn(θ)− op(1),

we have θn →p θ0.

b Suppose that ||Mn −M||K →p 0 for every compact K ⊂ Θ and that the mapθ 7→M(θ) is upper-semicontinuous with a unique maximum at θ0. Then the

same conclusion is true provided that θn = Op(1).

Lu Mao (UNC CH) Dec 4, 2013 8 / 23

CONSISTENCY

Example (Sample median)

θn = arg minθ

Pn|X − θ|

Use Theorem 1.1 b:

|θn| ≤ Pn|X − θn|+ Pn|X| ≤ 2Pn|X| = Op(1){| · −θ| : θ ∈ K} Glivenko-CantelliUniqueness of θ0 as minimizer of P |X − θ|

∂

∂θP |X − θ| = 2F (θ)− 1,

∂2

∂θ2P |X − θ| = 2f(θ)

strictly convex on the support of X

Conclusion: θn →p θ0

Lu Mao (UNC CH) Dec 4, 2013 9 / 23

CONSISTENCY

Theorem 1.2 (Wald)

Let θ 7→ mθ(x) be upper-semicontinuous for every x, and for every sufficientlysmall ball U ⊂ Θ

P supθ∈U

mθ <∞,

Then if θ0 is the unique maximizer of Pmθ, θn = Op(1) andPnmθn

≥ Pnmθ0 − op(1), we have

θn →p θ0.

Lu Mao (UNC CH) Dec 4, 2013 10 / 23

DISTRIBUTION

Suppose hn := rn(θn − θ0) = Op(1), then

hn = arg maxh

r2n

(Mn(θ0 + r−1

n h)−Mn(θ0))

=: arg maxh

Hn(h)

Theorem 1.3 (Argmax)

Suppose that Hn H in l∞(K) for every compact K ⊂ R, for a limit process

with continuous sample paths that have unqiue points of maxima h. Ifhn = Op(1) and Hn(hn) ≥ suphHn(h)− op(1), then

hn h.

Lu Mao (UNC CH) Dec 4, 2013 11 / 23

DISTRIBUTION

Example (Parametric MLE):

Hn(h) = r2nPn(log pθ0+r−1

n h − log pθ0)

= log

n∏i=1

pθ0+h/√n

pθ0(Xi)

= Gnh′ lθ0 −1

2h′Iθ0h+ op(1) (LAN)

h′Z − 1

2h′Iθ0h Z ∼ N(0, Iθ0)

=: H(h)

Therefore

√n(θn − θ0) = hn arg maxH(h) = I−1

θ0Z ∼ N(0, I−1

θ0)

Lu Mao (UNC CH) Dec 4, 2013 12 / 23

DISTRIBUTION

In general

Hn(h) = r2nPn(mθ0+r−1

n h −mθ0)

=r2n√nGn(mθ0+r−1

n h −mθ0) + r2nP (mθ0+r−1

n h −mθ0)

=r2n√nGn(mθ0+r−1

n h −mθ0) +1

2h′Vθ0h+ op(1)

(Vθ =

∂2

∂θ2Pmθ

) G(h) +

1

2h′Vθ0h

(1)for some zero-mean Gaussian process G.

Note that convergence of the Gn term concerns empirical processesindexed by

Fn :=r2n√nMK/rn , whereMδ = {mθ −mθ0 : d(θ, θ0) < δ}

Lu Mao (UNC CH) Dec 4, 2013 13 / 23

DISTRIBUTION

Empirical processes on index sets changing with n: VW Section 2.11

If (1) does hold, the variance function of G is given by

E(G(h)−G(g))2 = limn→∞

r4n

nP (mθ0+h/rn −mθ0+g/rn)2

The remaining (key) issue: finding rn

Lu Mao (UNC CH) Dec 4, 2013 14 / 23

RATE OF CONVERGENCE

Theorem 1.4 (Rate of Convergence, VW Theorem 3.2.5)

Let Mn be stochastic processes indexed by a semimetric space Θ andM : Θ→ R is a deterministic function, such that for every θ in aneighborhood of θ0,

M(θ)−M(θ0) . −d2(θ, θ0).

Suppose that for sufficiently small δ,

E supd(θ,θ0)<δ

|(Mn −M)(θ)− (Mn −M)(θ0)| . φn(δ)√n,

for functions φn such that δ 7→ φn(δ)/δα is decreasing for some α < 2. Let

r2nφn(1/rn) ≤

√n, for every n.

If the sequence θn satisfies Mn(θn) ≥Mn(θ0)−Op(r−2n ) and converges in

probability to θ0, thenrnd(θn, θ0) = Op(1).

Lu Mao (UNC CH) Dec 4, 2013 15 / 23

RATE OF CONVERGENCE

Remark:For empirical-type criteria function,

E||Gn||Mδ. φn(δ)

Use maximal inequality

E||Gn||Mδ. J[](1,Mδ, L2(P ))(PM2

δ )1/2,

where Mδ is the envelope function of Mδ, and if

J[](1,Mδ, L2(P )) =

∫ 1

0

√1 + logN[](ε||Mδ||P,2,Mδ, L2(P ))dε <∞,

uniformly in δ, then take

φn(δ) = (PM2δ )1/2

If φn(δ) = δα for some α < 2, then

rn = n1

2(2−α)

Lu Mao (UNC CH) Dec 4, 2013 16 / 23

RATE OF CONVERGENCE

Example (Lipschitz in parameter):

If for every θ1, θ2 in a neighborhood of θ0,

|mθ1 −mθ2 | ≤ m(x)||θ1 − θ2||,

with Pm2(x) <∞. Then

φn(δ) = (PM2δ )1/2 . δ.

This givesrn =

√n

Lu Mao (UNC CH) Dec 4, 2013 17 / 23

CURRENT STATUS

Interval censoring Case 1 (current status):Time-to-event data, examined only once

Example: A cross-sectional antibody test of people of various agesagainst Hepatitis A virus (Keiding, 1991)

Statistical problem: observe i.i.d. (U, δ),

U ∼ G on R+

δ = I(T ≤ U), T ∼ F on R+, T ⊥ UAim: estimate F

Method: nonparametric MLE (NPMLE)

Lu Mao (UNC CH) Dec 4, 2013 18 / 23

CHARACTERIZATION OF MLE

Regularity conditions: F0 and G admit Lebesgue densities f and grespectively

Likelihood:

ln(F ) = Pn(δ logF (U) + (1− δ) log(1− F (U)))

NPMLE: denote as Fn

Lu Mao (UNC CH) Dec 4, 2013 19 / 23

CHARACTERIZATION OF MLE

Theorem 2.1 (GW Proposition 1.2)

Re-order the observation times in ascending order such that U1 ≤ · · · ≤ Un.Let Hn be the greatest convex minorant (GCM) of the points (i,

∑ij=1 δj).

Then Fn(Ui) is the left derivative of Hn at i. Algebraically

Fn(Ui) = max1≤j≤i

mini≤k≤n

∑km=j δm

k − j + 1

Corollary 2.2

Denote Dn as the right continuous step function defined by the points(i/n, n−1

∑ij=1 δj), then

Fn(Ui) ≤ a iff arg mins{Dn(s)− as} ≥ i/n.

Lu Mao (UNC CH) Dec 4, 2013 20 / 23

CONSISTENCY OF MLE

Theorem 2.3 (Consistency of Fn(t))

Fix t, assume that f(t), g(t) > 0, then

Fn(t)→p F0(t)

Proof. See Example 5.17 ([V], pp 49) for a Wald’s consistency (Theroem1.2) proof, with (Θ, d)=the space of distribution functions equipped withthe weak topology.

Lu Mao (UNC CH) Dec 4, 2013 21 / 23

DISTRIBUTION OF MLE

Theorem 2.4 (Groeneboom, 1987)

Fix t, assume that f(t), g(t) > 0, then

n1/3{Fn(t)− F0(t)} (

4F0(1− F0)f

g(t)

)1/3

arg minh{Z(h) + h2},

where Z is a two-sided Brownian motion process originating from zero.

Proof. We first use Theorem 1.4 and the subsequent Remark to establishthat rn = n1/3, and then use the Argmax Theroem (Theorem 1.3) to findthe asymptotic distribution. See Example 3.2.15 ([VW, pp 298]) fordetails.

Lu Mao (UNC CH) Dec 4, 2013 22 / 23

References

Groeneboom, P. (1987). Asymptotics for interval censored observations. Report, 87, 18

[GW] Groeneboom, P., & Wellner, J. A. (1992). Information bounds and nonparametric maximum

likelihood estimation (Vol. 19). Springer.

[V] Van der Vaart, A. W. (2000). Asymptotic statistics (Vol. 3). Cambridge university press.

[VW] Van der Vaart, A. W., & Wellner, J. A. (1996). Weak Convergence and Empirical Processes.

Lu Mao (UNC CH) Dec 4, 2013 23 / 23

rates of convergence in m-estimation with an example from current status data

Science