robustness of the half-space median

.,z,,-lt., iil 7, . I ' J

ELSEVIER Journal of Statistical Planning and

Inference 46 (1995) 175-181

joumalof statistical planning and inference

Robustness of the half-space median

Zhiqiang Chen 1

Department of Mathematics, William Paterson College, Wayne, NJ 07470, USA

Received 15 November 1993; revised 10 August 1994

Abstract

In this note, we extend Donoho and Gasko's (Ann. Statist, 20 (1992) 1803-1827) results on finite sample breakdown point for the half-space median to any given data set and, as a consequence, we obtain bounds for the limiting breakdown point for general (in particular, nonsymmetric) distributions. Lower and upper bounds for the breakdown point of the half- space median with respect to the half-space metric are also established; in two-dimensional space, they yield the exact breakdown point. The exact 'gross error neighborhood' breakdown point for symmetric distributions is also given.

A M S Subject Classifications: Primary 62F35, 62H12.

Key words: Half-space median; Sample half-space median; Breakdown point; Robustness

1. Introduction

There are several quantitative measures of the robustness of a location estimator.

Perhaps the most popular are the finite sample breakdown points, obtained by enlarging the sample with some contamination points (or, respectively, replacing some sample points with 'bad ' points) and then considering the smallest proport ion of contamination points (resp. replacements) needed to upset the estimator (e.g. see Donoho and Gasko, 1992 and resp. Lopuha~ and Rousseeuw 1991). Davies (1992) argues that, although the finite sample breakdown point is easy to get and understand by non-statisticians, a breakdown point defined via some equivariant metric on all probability measures is preferable, and he suggests that the distance should be chosen according to the problem.

1 Research partially supported by NSF Grant No. DMS-9300725 and by The University of Connecticut Research Foundation Grant No. 441092.

0378-3758/95/$09.50 © 1995--Elsevier Science B.V. All rights reserved. SSDI 0378-3758(94100105-7

176 Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175-181

Let us recall the definition of the half-space median. Let t~ be the set of all the probability measures on R ~. For any P ~ ~ , define

roe(x) = inf PH(u, x), uES a-1

where H(u,x) :={y: u'y<~u'x} and Sa- l :={u~Ra: l u l = l } . Let My:= {arg max try(x)}. Small (1987) showed that My is a non-void convex compact set. The half-space median will be any choice of an element T(P) out of the set Me. Then, the empirical half-space median T(Pn) is a natural estimator of the 'true' median. (T(P) can be any selection from {arg max zv(x)} in this note but, in order to let the median keep the affine equivariant property, one can take an 'average' in {arg max he(x)}, as Donoho and Gasko proposed.) The formal definition of the finite sample (enlarge- ment) breakdown point of a location estimator is as follows. Let X t"~ be a given data set of size n. Let Tbe the estimator of interest. Consider adjoining to X ~n~ another data set y~,n~ of size m. The breakdown point ~r,e(T, X tn~) is

X " ' ) : = min ~ m "sup[ T(Xt" 'w y,m,)_ T(X'"')[ =go ~. ¢(T, ~f ' [ .n --[- m y ~ J

Donoho and Gasko (1992) proved the following result: If the probability measure P is absolutely continuous (with respect to Lebesgue measure) and centrosymmetric, then

lim ee. ~(T, X ¢">) = ½ a.s. n~oo

In this note, we extend Donoho and Gasko's breakdown result to the general case and establish sharp upper and lower bounds for the breakdown point of the half-space median in Davies' sense, which originated in Huber (1981). We also get the exact breakdown points for the half-space median in the symmetric distribution case using either contamination neighborhoods or gross error neighborhoods.

2. Results

Let us first give a lemma.

Lemma 1. V = {veSa-l: PnH(v, O) = infuP.H(u, 0)} is a non-void relatively open set.

Proof. Since Phil(u, O) has finitely many different values, infu P,H(u, 0) is obtained at some v ~ S a- 1. Let X~ ~ := {Xi ~ X tn~: X~ ~ H(v, O) }, for v ~ V. Then minx, ~ x':' vT x i =

> 0. By the continuity of the inner product and the fact that there are only finitely many points in X~ "~, there exist a relatively open neighborhood Uv of v, such that

inf min wT xI >--. weU~ X~eX~, "~ 2

Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175-181 177

Therefore, H(w, O) c~ X~ ) = 0 for any we U~, hence, P.H(w, O) <<. P.H(v, 0), which

implies that w e V. []

Proposition 1. Let X ~"~ be any data set of size n on R d with d >~ 2 and let P. be the empirical measure based on X t"). Then

~p.(Xo) 1 + rCp.(Xo)

~< cf,,(T, X~"~) ~< max~ E s~-, P.H(u, Xo)

1 + max.~s,-,P.H(u, Xo)'

where Xo e {arg max rrp.(X) }.

Proof. Fo r the lower bound, the p roof is identical to that of D o n o h o and Gasko ' s

(1992) Propos i t ion 3.3. So we will only prove the upper bound here. Since the half-space median is affine equivariant , w.l.o.g., let Xo = 0. Place m con tamina t ion

points on the same site y, where y is outside the convex hull of X t"~ and y e OH (v, 0) for some v e V as in the lemma. So we have ytm) = {Y/. Let P ,+, , be the probabi l i ty measure that assigns mass 1/(n + m) to each point in the con tamina ted da ta set X t") w Yt'~. Then, for any x 4= y, we observe that

n 7~p.+.(X) ~ max P.H(u, 0). (1)

n + m u

This is because, for any x ~ ty or x = ty with t ~< 0, there is a half-space H(u, O) containing x but not y, therefore the above inequality holds since 7rp.+.(x) P.+mH(u, x) <~ P.+mH(u, O) <~ n/(n + m)maxuP.H(u, 0). For x = ty with t > 1, clearly rCp.+.(x) = 0 because we can choose a half-space passing y and separating X t") w Yt'~) and x; finally, for x =- ty with re(0, 1), since yeOH(v, 0) for some v e V, by the proof of the L e m m a 1, we can choose a w e U~ c Vwhich satisfies e/2 > wTy > 0, then, wT(Xi -- y) > 0,

for all Xi e X~ ~ because infwe tr~ roJnx, ex~:~ wTxi > ~t/2, hence H(w, O) ~ H (w, x) ~ H (w, y), and X ~ c a H(w,y)= O, so a data point XieH(w,x) implies Xiq~X~ ~), that is, by the definition of X~ ), P.H(w, x) <~ P.H(v, 0). Therefore, np.+.(x) <~ P.+mH(w, x) <~ n/(n + m)P.H(v, 0). So in any case, the above observation (1) is true.

Combin ing (1) with the fact that rCp.. .(y)~ m/(n + m), we get that whenever

m >nmaxuP.H(u,O), gp.+.(y)> gp..=(X) for x c y, that is T ( P . + ~ ) = y. Thus by

letting lYl ~ m , T(P.+m) breaks down T(P.), Therefore,

maxu~s,-i P.H(u, O) el, e(T, P . ) ~< 1 + maxues,- , P.H(u, 0)" []

Corollary 1. On R ~ with d >1 2, almost surely the following bounds hold:

~p(Xo) 1 + nv(Xo)

~< lim infef, e(T, P . ) ~< lim sup ef.e(T, P . ) ~< Supues" ' PH(u, Xo)

1 + supuEs,-1PH(u, Xo)'

where Xo e {arg max ~v(x)}.

178 Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175 181

Proof. The

sup,, x [(P. - inf, PH(u, x)[

sups. ~ [(P, -

collection of all half-spaces is a measurable VC class, hence P)H(u, x)[ --* 0 a.s. (Vapnik and Cervonenkis, 1971). Thus, [inf. P,H(u, x) -

sup,,x[(P, - P)H(u, x)l ~ 0 a.s. and [sup~P,H(u, x) - s u p , PH(u, x)l P)H(u, x ) [ ~ 0 a.s. So, the corollary follows from Proposition 1. []

Liu (1990) introduced a weak version of symmetry called angularly symmetry. A r.v. X is angularly symmetric about a point a if - (X - a)/[ X - a[ and (X - a)/I X - a[ have the same law. Notice that when P is angularly symmetric about Xo, PH(u, Xo) = ½ for all u in S d- ~. Hence we have the following slight generalization of the result of Donoho and Gasko.

Corollary 2. I f the probability measure symmetric, then

lim el, ¢(T, Pn) = ~ a.s. n~ oo

P is absolutely continuous, and angularly

The definition of a breakdown point in Davies' (1992) sense is as follows. Suppose (~, p) is a metric space for some metric p. Then the breakdown point of a location estimator T at P with respect to the distance p is

e*(T ,P ,p)= inf{e: sup , T ( Q ) - T(P)[ = ~ }, QeB(P,e)

where B(P, e) is the e ball with center P. For any two probability measures P and Q, the half-space distance between P and

Q is defined as: pMQ, P) = supu, x [QH(u, x) - PH(u, x)[. We will establish some bounds for the breakdown point of the half-space median

under the half-space metric. In fact, we will obtain results in a slightly more general setting.

Proposition 2. For each probability measure P on R ~, suppose that Ge(x) is a non-negative, non-constant, upper semicontinuous function on R d satisfying the following properties:

(1) Ge is a bounded function and Gv(x)~ 0, as [x[--.oe, (2) supx [ G¢(x) - Gv(x)[ ~ p(Q, P), for some metric p on ~.

Let T: ~ ~ R d satisfy T(P) ~ {arg max Gp(x) } for P ~ ~. Then we have the following conclusions:

(a) The set {arg max G~(x)} is a non-void compact set. (b) maXye{argmaxG~(x)} d(y, {argmaxGe(x)}) is continuous at P, and hence T

is continuous at P if {argmaxG~(x)} consists of a single point. In particular, if p(P,,P)--*O a.s., then T(P,)--* T(P) a.s. where, for a point y and a set A, d(y, A) = minaeA [y -- a[.

(c) e*(T, P, p)/> max Gp(x)/2. (d) ~*(T,P,p) is Lipschitz continuous at all P.

Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175-181 179

Proof. (a) For any e > 0, L~:= {y: Gp(y) >~ supx Gp(x) - e} is a closed set because of the upper semicontinuity of Gp. If e is small enough, L~ is bounded because Ge(x) --* 0 as x ~ oe. Hence {L~: e < sup Gp(x)} is a collection of compact nested sets. Therefore

{argmax Gp(x)} = N~ >oL~ # 0 is a compact set. (b) For any Q e ~ , denote SQ := {arg max Gp(x) }, let aQ e S o, For any fixed P ~ ~ ,

for any e > 0, let 6~ := Ge(ae) - sup{x:d(x,S~)>>. ~} Gt,(x). Then 6~ > 0 by upper semicontinuity. If d(aQ, Se) > e, we have

6, <~ Ge(ae) - Ge(aQ) ~< 2 sup I Ge(x) - Ge(x) l ~< 2p(Q, P). x

Therefore, for any e > 0, d(aQ, Se) < e as long as p(Q, P) < fiJ2, which implies the conclusions of (b).

(c) For any e > 0, there is M > 0, such that Ge(x) < e when IxL > M. If {Q.} c ~' breaks P down, i.e. I T(Qn)I ~oe , then there is N such that [T(Qn)I > M whenever n > N. So that, for n > N, maxx Ge(ae) - Ge(T(Q.)) <<. 2p(Q., P). Hence, the breakdown point is greater than or equal to max Gp(x)/2 - e/2. Since e is arbitrary, we get e*(T, P, p)>~ max Ge(x)/2. We omit the proof for (d) since it is not needed in what follows. []

Corollary 3. For the half space-median, we have (1) The set {argmaxnp(x)} is a non-void compact set. (2) I f the median is unique at P, then T(P.) is a strongly consistent estimator of the

median. (3) e*(T,P, pu) >t maxne(x)/2.

Proof. Donoho and Gasko (1992, Lemma 6.1) proved that 7tl,(x) is upper semicontinuous. The remaining conditions in the previous proposition for G p ( x ) - - 7Zp(X) are easy to verify. []

Note that the above lower bound for the breakdown point is attainable. The next proposition says that the breakdown point under the half-space distance is

not larger than the finite sample breakdown point. The proof is omitted here.

Proposition 3. e*(T,P.,pH) <<. ,~f, ¢(T,P.). Hence,

e*(T,P, pH) <~ lim supef,¢(T,P.) ~< supues" ' PH(u, Xo)

1 + supu~s,-, PH(u, Xo) a.s.

Next we obtain a different, sometimes better, lower bound, and improve the upper bound for e*(T,P, pn).

Proposition 4. For any distribution P on R e, e*(T,P, pH) >>-1/(d + 1). f f P is an absolutely continuous distribution on R e, where d >~ 2, then e*(T, P, pn) ~< 13.

180 Z. Chen / Journal of Statistical Planning and Inference 46 (1995) 175-181

Proof. For any e > 0, there is M > 0 so that, sUp{x:d(x, Me ) > M} Up(X) < g. (Recall that Me = {arg max rrl,(x)}.) Since max zre(x )/> 1/(d + 1) for any distribution Q (Donoho and Gasko 1992, Lemma 6.3), if T(Q)e {x: d(x, Me) > M} we have

1 >1 up(T(Q)) >1 uQ(T(Q)) - ne(T(Q)) + 7zp(T(Q)) >~ d +----1 - pu(Q, P)

So, pn(Q,P) ~> 1/(d+ 1 ) - e . This means e*(T,P,p~)>~ i / ( d + 1 ) - e . Since e is arbitrary, the conclusion for the lower bound follows.

For the upper bound, choose u such that PH(u, Xo)= ½, and choose a point y~t~H(u, xo). For arbitrarily small e, let Qr = ( 3 z - e ) P +(13 + e)6r. Then, PH(Qy, P) <~ 13 + e and r%(y) ~. (~ + e).

For x # y , we have two cases: if xq~dH(u, xo), choose v = u or - u so that y ¢~ H (v, x). Hence noy (X) <~ ( z 3 - e)/2 ~< 31; for x e 8 H (u, Xo), choose a half-space H (v, x) so close to H(u, Xo) or H ( - u, Xo) that PH(v, x) ~ ½ + e/2, with the property that it does not contain y, hence, rto,(X) ~< (2 _ e)(½ + e/2) ~< ½. So, ltQ,(y) > rCo,(X ) for x # y, therefore, T(P) is broken down by sending y to oo, that is e*(T,P, pn) <~ ½ + e. Since e is arbitrary, we get, e*(T,P, pH) <~ ~. []

Remarks. (1) For d = 2, Proposition 4 gives exact breakdown point 13 for any continuous distributions. (2) Combining Propositions 3 and 4, we get gf, e ( T , X (n)) >I e*(T,P, pu) >>- 1/(d + 1). So, we have removed the condition that the data set is 'in general position' of a similar result of Donoho and Gasko (1992, Proposition 3.4).

For a given probability measure P, the set of all probability measure Q of the form Q = (1 - t ) P + tP', where 0 ~< t < e and P' is a probability measure, is called an e contamination neighborhood of P. If we restrict P' in this definition to P' -- 6x, it is called a gross error neighborhood of P. If we define

inf~e: supl T ( ( 1 - e)P + e P ' ) - T(P)I =oo g*(T, P) 5x )

and

e*(T,P) = inf{e: supl T ( ( 1 - e)P + ebx ) - T(P)l = ~

then, with a proof similar to that of the previous proposition, we get the following proposition.

Proposition 5. e*(T, P) >1 e*(T, P) >1 max ~p(X)/(1 -b max 7ze(x)). Moreover, if P is an angularly symmetric, absolutely continuous probability measure on g d with d >1 2, then e*(T,P) = 31 and e*(T,P) = ½.

Acknowledgements

I am grateful to Professor E. Gin6 for guidance and encouragement.

Z. Chen/Journal of Statistical Planning and Inference 46 (1995) 175- 181 181

References

Davies, B.L. (1992). The asymptotics of Rousseeuw's minimum volume ellipsoid estimator. Ann. Statist. 20, 1828-1843.

Donoho, D. (1982). Breakdown properties of multivariate location estimators. Ph.D. Qualifying paper, Dept. of Stat., Harvard Univ.

Donoho, D. and M. Gasko (1992). Breakdown properties of location estimates based on half-space depth and projected outlyingness. Ann~ Statist. 20, 1803-1827.

Liu, R.Y. (1990). On a notion of data depth based on random simplices. Ann. Statist. 18, 405-414. Lopuha/i, H.P. and P.J. Rousseeuw (1991). Breakdown points of affine equivariant estimators of multi-

variate location and covariance matrices. Ann. Statist. 19, 229-248. Huber, P.J. (1981). Robust Statistics. Wiley, New York. Small, C.G. (1987). Measures of centrality for multivariate and directional distributions. Can. d. Statist. 15,

31-39. Tukey, J.W. (1974). Order statistics. Unpublished lecture notes for statistics. Vapnik, V.N. and A.Ja. (~ervonenkis (1971). Necessary and sufficient conditions for the convergence of

means to their expectations, Theory Probab. Appl. 26, 532-553.

robustness of the half-space median

Documents