simplicial depth: an improved definition, analysis, and efficiency in the finite sample case michael...

22
Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts University www.cs.tufts.edu/research/geometry CCCG 2004 NSF grant #EIA-99- 96237

Upload: darrin-throop

Post on 28-Mar-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Simplicial Depth: An Improved Definition, Analysis, and Efficiency in

the Finite Sample Case

Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine

Tufts University

www.cs.tufts.edu/research/geometry

CCCG 2004 NSF grant #EIA-99-96237

Page 2: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Introduction

• Introduction to Data Depth– Why?– Examples– Desirable Properties

• Simplicial Depth– Definition– Properties– Problems

• Revised Definition– Definition– Properties

• Ongoing work

Page 3: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

What is Data Depth and Why?• Measures how deep (central) a

given point is relative to a distribution or a data cloud.– Deals with the shape of the data.– Can be thought of as a measure

of how well a point characterizes a data set

• Provides an alternative to classical statistical analysis.– No assumption about the

underlying distribution of the data.

• Deals with outliers.• Why study?

– Many measures are geometric in nature.

– Can be computationally expensive to compute depth.

dRp

Page 4: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Examples

• Half-Space (Tukey, Location) (Tukey 75)• Regression Depth (Rousseeuw and Hubert 94)• Simplicial Depth (Liu 90)• … and many more.

3 Data Points in this Half-plane

2 Data Points in this Half-plane

S

2; pSHD

p

Page 5: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Desirable Properties of Data Depth

• Liu (90) / Serfling and Zuo (00)– P1 – Affine Invariance– P2 – Maximality at Center– P3 – Monotonicity Relative to Deepest Point– P4 – Vanishing at Infinity

• We propose (BRS 04)– P5 – Invariance Under Dimensions Change

Page 6: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Affine Invariance (P1)

A – affine transformation

S SA

p

pA

Page 7: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Maximality at Center (P2)

p is the center

q is any point

Page 8: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Monotonicity Relative to Deepest Point (P3)

pqrpSqS

r

;D;D

1,0

p is the deepest point

q is any point

point between p and q

Page 9: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Vanishing at Infinity (P4)

0;Dlim

qSq

q is far from the data cloud

Page 10: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Invariance Under Dimensions Change (P5)

bdb

d

CpS

pS,;D

;D

Is this an data set?2R

Is this an data set?1R

p

Page 11: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Simplicial Depth (Liu 90)

• The simplicial depth of a point p with respect to a probability distribution F in is the probability that a random closed simplex in contains p.

where is a closed simplex formed by d+1 random observations from F.

• The simplicial depth of a point p with respect to a data set in is the fraction of closed simplicies formed by d+1 points of S containing p.

where I is the indicator function.

11 ,,; dFLiu XXSpPpFSD

11,,I

3;

1

diiLiu XXSpn

pSSD

nXXS ,,1

dR

dR

dR

11 ,, dXXS

Page 12: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Sample Version of Simplicial Depth• The simplicial depth of a point p with respect to a data set

in is the fraction of closed simplicies formed by d+1 points of S containing p.

nXXS ,,1 dR

p

Total number of simplicies= =20( )6

3

p is contained in 6 simplicies

The depth of p= =.36

20

__

.2

.2

.2 .2

.2.3

.3

.3

.3

.3.3.3

.3

.3.3

.4

.4.4

.4

.4

Page 13: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Properties

• Is a statistical depth function in the continuous case. (Liu 90)

• Is affine invariant (P1) and vanishes at infinity (P4) in the sample case. (Serfling and Zuo 00)

Page 14: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Problems in the Sample Case

• Does not always attain maximality at the center (P2) and does not always have monotonicity relative to the deepest point (P3). (Serfling and Zuo 00)

• The depth on the boundary of cells is at least the depth in each of the adjacent cells – causes discontinuities.

• Does not have invariance under dimensions change (P5).

Page 15: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Simplicial Depth (Liu 90)

Total number of simplicies = ( ) = 1053

A

B C

D

E.3

.4

.5.4

.4.4

.3

.3

.3

.6

.6.6

.6

.8

.5

.5

.35

Averaging number of closed and open simplicies containing a point

.3

.3

.3

.3

(BRS 04)

.7 .4X

Y

Page 16: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Revised Definition (BRS 04)

• The simplicial depth of a point p with respect to a data set in is the average of the fraction of closed simplicies containing p and the fraction of open simplicies containing p, formed by d+1 points of S.

• Equivalently

• - the fraction of simplicies with data points as vertices which contain p in their open interior.

• - the fraction of simplicies with data points as vertices which contain p in their boundary.

nXXS ,,1

pSpSpSSDBRS ;2

1;;

dR

pS;

pS;

Page 17: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Properties of the Revised Definition

• Reduces to the original definition, for continuous distributions and for points lying in the interior of cells.

• Keeps ranking order of data points• Can be calculated using the existing algorithms, with

slight modifications.• Fixes Zuo and Serfling’s counterexamples.• The depth on the boundary of two cells is the average

of the two adjacent cells.• Invariant under dimensions change (P5) for the

change from to .1R2R

Page 18: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Invariance Under Dimension Change (P5)

• Degenerate simplicies– Both points C and A (a point between B and C) lie within the open

(degenerate) simplex BCD – think of it as a very thin triangle.

– Both points B and D are vertices of the (degenerate) simplex BCD.

• For a point, p, consider the ratio: – For both definitions, the ratio for a position (non-data point) is 2/3.

– For Liu’s definition, the ratio for a data point is not 2/3.

– For the BRS definition, the ratio for a data point is 2/3.

pSSD

pSSD

;

;2

1

Page 19: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Remaining Problems (P2 and P3)

1120

587

1120

355

Page 20: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Remaining Problems (Data Points)

• Data points are still over counted – there can still be discontinuities at data points. However, to fix the depth at data points, more features need to be considered.– Data points are inherently part of simplicies (a point makes a

triangle with every other pair of points) and edges are inherently part of simplicies (the two endpoints of an edge make a triangle with every other vertex).

– To retain invariance under dimensions change (P5), given a data set in , which lies on a d-flat, then the depth of a point when the data set is evaluated as a d-dimensional data set should be a multiple of the depth when the data set is evaluated as a b-dimensional data set.

• Neither of the above ideas completely solve the problem and it appears that the best solutions take into account the geometry of the entire data set.

2

1n

1

2n

bR

Page 21: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

Ongoing Work

• The current algorithm for finding the median (the deepest point) is O(n4) to walk an arrangement of O(n2) segments.– We can improve this algorithm by comparing simplicial

depth and half-space depth.– We are further improving this by considering simplicial

depth in the dual.• The problems with data points are improved by

generalizing this work to higher dimensions.• To find the depth at all points, we are using local

information to form an approximation for the depth measure.

Page 22: Simplicial Depth: An Improved Definition, Analysis, and Efficiency in the Finite Sample Case Michael A. Burr, Eynat Rafalin, and Diane L. Souvaine Tufts

References• G. Aloupis, C. Cortes, F. Gomez, M. Soss, and G. Toussaint. Lower bounds for

computing statistical depth. Computational Statistics & Data Analysis, 40(2):223-229, 2002.

• G. Aloupis, S. Langerman, M. Soss, and G. Toussaint. Algorithms for bivariate medians and a Fermat-Torricelli problem for lines. In Proc. 13th CCCG, pages 21-24, 2001.

• M. Burr, E. Rafalin, and D. L. Souvaine. Simplicial depth: An improved definition, analysis, and efficiency for the sample case. Technical report 2003-28, DIMACS, 2003.

• A. Y. Cheng and M. Ouyang. On algorithms for simplicial depth. In Proc. 13th CCCG, pages 53-56, 2001.

• J. Gil, W. Steiger, and A. Wigderson. Geometric medians. Discrete Math., 108(1-3):37-51, 1992. Topological, algebraical and combinatorial structures. Frolík's memorial volume.

• S. Khuller and J. S. B. Mitchell. On a triangle counting problem. Inform. Process. Lett., 33(6):319-321, 1990.

• R. Liu. On a notion of data depth based on random simplices. Ann. of Statist., 18:405-414, 1990.

• Y. Zuo and R. Serfling. General notions of statistical depth function. Ann. Statist., 28(2):461-482, 2000.