robustness of the half-space median

7
.,z,,-lt., iil 7, . I' J ELSEVIER Journal of Statistical Planning and Inference 46 (1995) 175-181 joumalof statisticalplanning and inference Robustness of the half-space median Zhiqiang Chen 1 Department of Mathematics, William Paterson College, Wayne, NJ 07470, USA Received 15 November 1993; revised 10 August 1994 Abstract In this note, we extend Donoho and Gasko's (Ann. Statist, 20 (1992) 1803-1827) results on finite sample breakdown point for the half-space median to any given data set and, as a consequence, we obtain bounds for the limiting breakdown point for general (in particular, nonsymmetric) distributions. Lower and upper bounds for the breakdown point of the half- space median with respect to the half-space metric are also established; in two-dimensional space, they yield the exact breakdown point. The exact 'gross error neighborhood' breakdown point for symmetric distributions is also given. AMS Subject Classifications: Primary 62F35, 62H12. Key words: Half-space median; Sample half-space median; Breakdown point; Robustness 1. Introduction There are several quantitative measures of the robustness of a location estimator. Perhaps the most popular are the finite sample breakdown points, obtained by enlarging the sample with some contamination points (or, respectively, replacing some sample points with 'bad' points) and then considering the smallest proportion of contamination points (resp. replacements) needed to upset the estimator (e.g. see Donoho and Gasko, 1992 and resp. Lopuha~ and Rousseeuw 1991). Davies (1992) argues that, although the finite sample breakdown point is easy to get and understand by non-statisticians, a breakdown point defined via some equivariant metric on all probability measures is preferable, and he suggests that the distance should be chosen according to the problem. 1Research partially supported by NSF Grant No. DMS-9300725 and by The University of Connecticut Research Foundation Grant No. 441092. 0378-3758/95/$09.50 © 1995--Elsevier Science B.V. All rights reserved. SSDI 0378-3758(94100105-7

Upload: zhiqiang-chen

Post on 21-Jun-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Robustness of the half-space median

.,z,,-lt., iil 7, . I ' J

ELSEVIER Journal of Statistical Planning and

Inference 46 (1995) 175-181

joumalof statistical planning and inference

Robustness of the half-space median

Zhiqiang Chen 1

Department of Mathematics, William Paterson College, Wayne, NJ 07470, USA

Received 15 November 1993; revised 10 August 1994

Abstract

In this note, we extend Donoho and Gasko's (Ann. Statist, 20 (1992) 1803-1827) results on finite sample breakdown point for the half-space median to any given data set and, as a consequence, we obtain bounds for the limiting breakdown point for general (in particular, nonsymmetric) distributions. Lower and upper bounds for the breakdown point of the half- space median with respect to the half-space metric are also established; in two-dimensional space, they yield the exact breakdown point. The exact 'gross error neighborhood' breakdown point for symmetric distributions is also given.

A M S Subject Classifications: Primary 62F35, 62H12.

Key words: Half-space median; Sample half-space median; Breakdown point; Robustness

1. Introduction

There are several quantitative measures of the robustness of a location estimator.

Perhaps the most popular are the finite sample breakdown points, obtained by enlarging the sample with some contamination points (or, respectively, replacing some sample points with 'bad ' points) and then considering the smallest proport ion of contamination points (resp. replacements) needed to upset the estimator (e.g. see Donoho and Gasko, 1992 and resp. Lopuha~ and Rousseeuw 1991). Davies (1992) argues that, although the finite sample breakdown point is easy to get and understand by non-statisticians, a breakdown point defined via some equivariant metric on all probability measures is preferable, and he suggests that the distance should be chosen according to the problem.

1 Research partially supported by NSF Grant No. DMS-9300725 and by The University of Connecticut Research Foundation Grant No. 441092.

0378-3758/95/$09.50 © 1995--Elsevier Science B.V. All rights reserved. SSDI 0378-3758(94100105-7

Page 2: Robustness of the half-space median

176 Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175-181

Let us recall the definition of the half-space median. Let t~ be the set of all the probability measures on R ~. For any P ~ ~ , define

roe(x) = inf PH(u, x), uES a-1

where H(u,x) :={y: u'y<~u'x} and Sa- l :={u~Ra: l u l = l } . Let My:= {arg max try(x)}. Small (1987) showed that My is a non-void convex compact set. The half-space median will be any choice of an element T(P) out of the set Me. Then, the empirical half-space median T(Pn) is a natural estimator of the 'true' median. (T(P) can be any selection from {arg max zv(x)} in this note but, in order to let the median keep the affine equivariant property, one can take an 'average' in {arg max he(x)}, as Donoho and Gasko proposed.) The formal definition of the finite sample (enlarge- ment) breakdown point of a location estimator is as follows. Let X t"~ be a given data set of size n. Let Tbe the estimator of interest. Consider adjoining to X ~n~ another data set y~,n~ of size m. The breakdown point ~r,e(T, X tn~) is

X " ' ) : = min ~ m "sup[ T(Xt" 'w y,m,)_ T(X'"')[ =go ~. ¢(T, ~f ' [ .n --[- m y ~ J

Donoho and Gasko (1992) proved the following result: If the probability measure P is absolutely continuous (with respect to Lebesgue measure) and centrosymmetric, then

lim ee. ~(T, X ¢">) = ½ a.s. n~oo

In this note, we extend Donoho and Gasko's breakdown result to the general case and establish sharp upper and lower bounds for the breakdown point of the half-space median in Davies' sense, which originated in Huber (1981). We also get the exact breakdown points for the half-space median in the symmetric distribution case using either contamination neighborhoods or gross error neighborhoods.

2. Results

Let us first give a lemma.

Lemma 1. V = {veSa-l: PnH(v, O) = infuP.H(u, 0)} is a non-void relatively open set.

Proof. Since Phil(u, O) has finitely many different values, infu P,H(u, 0) is obtained at some v ~ S a- 1. Let X~ ~ := {Xi ~ X tn~: X~ ~ H(v, O) }, for v ~ V. Then minx, ~ x':' vT x i =

> 0. By the continuity of the inner product and the fact that there are only finitely many points in X~ "~, there exist a relatively open neighborhood Uv of v, such that

inf min wT xI >--. weU~ X~eX~, "~ 2

Page 3: Robustness of the half-space median

Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175-181 177

Therefore, H(w, O) c~ X~ ) = 0 for any we U~, hence, P.H(w, O) <<. P.H(v, 0), which

implies that w e V. []

Proposition 1. Let X ~"~ be any data set of size n on R d with d >~ 2 and let P. be the empirical measure based on X t"). Then

~p.(Xo) 1 + rCp.(Xo)

~< cf,,(T, X~"~) ~< max~ E s~-, P.H(u, Xo)

1 + max.~s,-,P.H(u, Xo)'

where Xo e {arg max rrp.(X) }.

Proof. Fo r the lower bound, the p roof is identical to that of D o n o h o and Gasko ' s

(1992) Propos i t ion 3.3. So we will only prove the upper bound here. Since the half-space median is affine equivariant , w.l.o.g., let Xo = 0. Place m con tamina t ion

points on the same site y, where y is outside the convex hull of X t"~ and y e OH (v, 0) for some v e V as in the lemma. So we have ytm) = {Y/. Let P ,+, , be the probabi l i ty measure that assigns mass 1/(n + m) to each point in the con tamina ted da ta set X t") w Yt'~. Then, for any x 4= y, we observe that

n 7~p.+.(X) ~ max P.H(u, 0). (1)

n + m u

This is because, for any x ~ ty or x = ty with t ~< 0, there is a half-space H(u, O) containing x but not y, therefore the above inequality holds since 7rp.+.(x) P.+mH(u, x) <~ P.+mH(u, O) <~ n/(n + m)maxuP.H(u, 0). For x = ty with t > 1, clearly rCp.+.(x) = 0 because we can choose a half-space passing y and separating X t") w Yt'~) and x; finally, for x =- ty with re(0, 1), since yeOH(v, 0) for some v e V, by the proof of the L e m m a 1, we can choose a w e U~ c Vwhich satisfies e/2 > wTy > 0, then, wT(Xi -- y) > 0,

for all Xi e X~ ~ because infwe tr~ roJnx, ex~:~ wTxi > ~t/2, hence H(w, O) ~ H (w, x) ~ H (w, y), and X ~ c a H(w,y)= O, so a data point XieH(w,x) implies Xiq~X~ ~), that is, by the definition of X~ ), P.H(w, x) <~ P.H(v, 0). Therefore, np.+.(x) <~ P.+mH(w, x) <~ n/(n + m)P.H(v, 0). So in any case, the above observation (1) is true.

Combin ing (1) with the fact that rCp.. .(y)~ m/(n + m), we get that whenever

m >nmaxuP.H(u,O), gp.+.(y)> gp..=(X) for x c y, that is T ( P . + ~ ) = y. Thus by

letting lYl ~ m , T(P.+m) breaks down T(P.), Therefore,

maxu~s,-i P.H(u, O) el, e(T, P . ) ~< 1 + maxues,- , P.H(u, 0)" []

Corollary 1. On R ~ with d >1 2, almost surely the following bounds hold:

~p(Xo) 1 + nv(Xo)

~< lim infef, e(T, P . ) ~< lim sup ef.e(T, P . ) ~< Supues" ' PH(u, Xo)

1 + supuEs,-1PH(u, Xo)'

where Xo e {arg max ~v(x)}.

Page 4: Robustness of the half-space median

178 Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175 181

Proof. The

sup,, x [(P. - inf, PH(u, x)[

sups. ~ [(P, -

collection of all half-spaces is a measurable VC class, hence P)H(u, x)[ --* 0 a.s. (Vapnik and Cervonenkis, 1971). Thus, [inf. P,H(u, x) -

sup,,x[(P, - P)H(u, x)l ~ 0 a.s. and [sup~P,H(u, x) - s u p , PH(u, x)l P)H(u, x ) [ ~ 0 a.s. So, the corollary follows from Proposition 1. []

Liu (1990) introduced a weak version of symmetry called angularly symmetry. A r.v. X is angularly symmetric about a point a if - (X - a)/[ X - a[ and (X - a)/I X - a[ have the same law. Notice that when P is angularly symmetric about Xo, PH(u, Xo) = ½ for all u in S d- ~. Hence we have the following slight generalization of the result of Donoho and Gasko.

Corollary 2. I f the probability measure symmetric, then

lim el, ¢(T, Pn) = ~ a.s. n~ oo

P is absolutely continuous, and angularly

The definition of a breakdown point in Davies' (1992) sense is as follows. Suppose (~, p) is a metric space for some metric p. Then the breakdown point of a location estimator T at P with respect to the distance p is

e*(T ,P ,p)= inf{e: sup , T ( Q ) - T(P)[ = ~ }, QeB(P,e)

where B(P, e) is the e ball with center P. For any two probability measures P and Q, the half-space distance between P and

Q is defined as: pMQ, P) = supu, x [QH(u, x) - PH(u, x)[. We will establish some bounds for the breakdown point of the half-space median

under the half-space metric. In fact, we will obtain results in a slightly more general setting.

Proposition 2. For each probability measure P on R ~, suppose that Ge(x) is a non-negative, non-constant, upper semicontinuous function on R d satisfying the following properties:

(1) Ge is a bounded function and Gv(x)~ 0, as [x[--.oe, (2) supx [ G¢(x) - Gv(x)[ ~ p(Q, P), for some metric p on ~.

Let T: ~ ~ R d satisfy T(P) ~ {arg max Gp(x) } for P ~ ~. Then we have the following conclusions:

(a) The set {arg max G~(x)} is a non-void compact set. (b) maXye{argmaxG~(x)} d(y, {argmaxGe(x)}) is continuous at P, and hence T

is continuous at P if {argmaxG~(x)} consists of a single point. In particular, if p(P,,P)--*O a.s., then T(P,)--* T(P) a.s. where, for a point y and a set A, d(y, A) = minaeA [y -- a[.

(c) e*(T, P, p)/> max Gp(x)/2. (d) ~*(T,P,p) is Lipschitz continuous at all P.

Page 5: Robustness of the half-space median

Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175-181 179

Proof. (a) For any e > 0, L~:= {y: Gp(y) >~ supx Gp(x) - e} is a closed set because of the upper semicontinuity of Gp. If e is small enough, L~ is bounded because Ge(x) --* 0 as x ~ oe. Hence {L~: e < sup Gp(x)} is a collection of compact nested sets. Therefore

{argmax Gp(x)} = N~ >oL~ # 0 is a compact set. (b) For any Q e ~ , denote SQ := {arg max Gp(x) }, let aQ e S o, For any fixed P ~ ~ ,

for any e > 0, let 6~ := Ge(ae) - sup{x:d(x,S~)>>. ~} Gt,(x). Then 6~ > 0 by upper semicon- tinuity. If d(aQ, Se) > e, we have

6, <~ Ge(ae) - Ge(aQ) ~< 2 sup I Ge(x) - Ge(x) l ~< 2p(Q, P). x

Therefore, for any e > 0, d(aQ, Se) < e as long as p(Q, P) < fiJ2, which implies the conclusions of (b).

(c) For any e > 0, there is M > 0, such that Ge(x) < e when IxL > M. If {Q.} c ~' breaks P down, i.e. I T(Qn)I ~oe , then there is N such that [T(Qn)I > M whenever n > N. So that, for n > N, maxx Ge(ae) - Ge(T(Q.)) <<. 2p(Q., P). Hence, the break- down point is greater than or equal to max Gp(x)/2 - e/2. Since e is arbitrary, we get e*(T, P, p)>~ max Ge(x)/2. We omit the proof for (d) since it is not needed in what follows. []

Corollary 3. For the half space-median, we have (1) The set {argmaxnp(x)} is a non-void compact set. (2) I f the median is unique at P, then T(P.) is a strongly consistent estimator of the

median. (3) e*(T,P, pu) >t maxne(x)/2.

Proof. Donoho and Gasko (1992, Lemma 6.1) proved that 7tl,(x) is upper semicon- tinuous. The remaining conditions in the previous proposition for G p ( x ) - - 7Zp(X) are easy to verify. []

Note that the above lower bound for the breakdown point is attainable. The next proposition says that the breakdown point under the half-space distance is

not larger than the finite sample breakdown point. The proof is omitted here.

Proposition 3. e*(T,P.,pH) <<. ,~f, ¢(T,P.). Hence,

e*(T,P, pH) <~ lim supef,¢(T,P.) ~< supues" ' PH(u, Xo)

1 + supu~s,-, PH(u, Xo) a.s.

Next we obtain a different, sometimes better, lower bound, and improve the upper bound for e*(T,P, pn).

Proposition 4. For any distribution P on R e, e*(T,P, pH) >>-1/(d + 1). f f P is an absolutely continuous distribution on R e, where d >~ 2, then e*(T, P, pn) ~< 13.

Page 6: Robustness of the half-space median

180 Z. Chen / Journal of Statistical Planning and Inference 46 (1995) 175-181

Proof. For any e > 0, there is M > 0 so that, sUp{x:d(x, Me ) > M} Up(X) < g. (Recall that Me = {arg max rrl,(x)}.) Since max zre(x )/> 1/(d + 1) for any distribution Q (Donoho and Gasko 1992, Lemma 6.3), if T(Q)e {x: d(x, Me) > M} we have

1 >1 up(T(Q)) >1 uQ(T(Q)) - ne(T(Q)) + 7zp(T(Q)) >~ d +----1 - pu(Q, P)

So, pn(Q,P) ~> 1/(d+ 1 ) - e . This means e*(T,P,p~)>~ i / ( d + 1 ) - e . Since e is arbitrary, the conclusion for the lower bound follows.

For the upper bound, choose u such that PH(u, Xo)= ½, and choose a point y~t~H(u, xo). For arbitrarily small e, let Qr = ( 3 z - e ) P +(13 + e)6r. Then, PH(Qy, P) <~ 13 + e and r%(y) ~. (~ + e).

For x # y , we have two cases: if xq~dH(u, xo), choose v = u or - u so that y ¢~ H (v, x). Hence noy (X) <~ ( z 3 - e)/2 ~< 31; for x e 8 H (u, Xo), choose a half-space H (v, x) so close to H(u, Xo) or H ( - u, Xo) that PH(v, x) ~ ½ + e/2, with the property that it does not contain y, hence, rto,(X) ~< (2 _ e)(½ + e/2) ~< ½. So, ltQ,(y) > rCo,(X ) for x # y, therefore, T(P) is broken down by sending y to oo, that is e*(T,P, pn) <~ ½ + e. Since e is arbitrary, we get, e*(T,P, pH) <~ ~. []

Remarks. (1) For d = 2, Proposition 4 gives exact breakdown point 13 for any con- tinuous distributions. (2) Combining Propositions 3 and 4, we get gf, e ( T , X (n)) >I e*(T,P, pu) >>- 1/(d + 1). So, we have removed the condition that the data set is 'in general position' of a similar result of Donoho and Gasko (1992, Proposition 3.4).

For a given probability measure P, the set of all probability measure Q of the form Q = (1 - t ) P + tP', where 0 ~< t < e and P' is a probability measure, is called an e contamination neighborhood of P. If we restrict P' in this definition to P' -- 6x, it is called a gross error neighborhood of P. If we define

inf~e: supl T ( ( 1 - e)P + e P ' ) - T(P)I =oo g*(T, P) 5x )

and

e*(T,P) = inf{e: supl T ( ( 1 - e)P + ebx ) - T(P)l = ~

then, with a proof similar to that of the previous proposition, we get the following proposition.

Proposition 5. e*(T, P) >1 e*(T, P) >1 max ~p(X)/(1 -b max 7ze(x)). Moreover, if P is an angularly symmetric, absolutely continuous probability measure on g d with d >1 2, then e*(T,P) = 31 and e*(T,P) = ½.

Acknowledgements

I am grateful to Professor E. Gin6 for guidance and encouragement.

Page 7: Robustness of the half-space median

Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175- 181 181

References

Davies, B.L. (1992). The asymptotics of Rousseeuw's minimum volume ellipsoid estimator. Ann. Statist. 20, 1828-1843.

Donoho, D. (1982). Breakdown properties of multivariate location estimators. Ph.D. Qualifying paper, Dept. of Stat., Harvard Univ.

Donoho, D. and M. Gasko (1992). Breakdown properties of location estimates based on half-space depth and projected outlyingness. Ann~ Statist. 20, 1803-1827.

Liu, R.Y. (1990). On a notion of data depth based on random simplices. Ann. Statist. 18, 405-414. Lopuha/i, H.P. and P.J. Rousseeuw (1991). Breakdown points of affine equivariant estimators of multi-

variate location and covariance matrices. Ann. Statist. 19, 229-248. Huber, P.J. (1981). Robust Statistics. Wiley, New York. Small, C.G. (1987). Measures of centrality for multivariate and directional distributions. Can. d. Statist. 15,

31-39. Tukey, J.W. (1974). Order statistics. Unpublished lecture notes for statistics. Vapnik, V.N. and A.Ja. (~ervonenkis (1971). Necessary and sufficient conditions for the convergence of

means to their expectations, Theory Probab. Appl. 26, 532-553.