multivariate and functional classi cation using depth and ...€¦ · multivariate and functional...

Multivariate and functional classification

using depth and distance

Mia Hubert, Peter J. Rousseeuw, Pieter Segaert

April 7, 2015

Abstract

We construct classifiers for multivariate and functional data, aiming to combineaffine invariance, robustness, and computational feasibility. The recent approach ofLi et al. (2012) is affine invariant but performs poorly with depth functions that be-come zero outside the convex hull of the data. For this purpose the bagdistance (bd)is proposed, based on halfspace depth. It satisfies most of the properties of a normbut is able to reflect asymmetry. Rather than transforming the data to their depthswe propose the DistSpace transform, based on bd or a measure of outlyingness, andfollowed by k-nearest neighbor (kNN) classification of the transformed data points.This combines affine invariance with the simplicity, general applicability and robust-ness of kNN. The proposal is compared with other methods in experiments with realand simulated data.

1 Introduction

Supervised classification of multivariate data is a common statistical problem. One isgiven a training set of observations and their membership to certain groups (classes).Based on this information, one must assign new observations to these groups. Examplesof classification rules include, but are not limited to, linear and quadratic discriminantanalysis, k-nearest neighbors (kNN), support vector machines, and decision trees. Foran overview see e.g. Hastie et al. (2009).

Nowadays statisticians are well aware of the potential effects of outliers on data anal-ysis, and there is an extensive literature on detecting outliers and developing methodsthat are robust to them, see e.g. Rousseeuw and Leroy (1987) and Maronna et al. (2006).Outliers can be caused by recording errors or typing mistakes. But outliers may alsobe valid observations that are sampled from a different population. Moreover, in super-vised classification some observations in the training set may have been mislabeled, i.e.attributed to the wrong group.

Many of the classical and robust classification methods rely on distributional assump-tions such as multivariate normality or elliptical symmetry. Most robust approaches thatcan deal with skewed data make use of the concept of depth. The first type of depth was

1

arX

iv:1

504.

0112

8v1

[st

at.M

E]

5 A

pr 2

015

the halfspace depth of Tukey (1975), which measures the centrality of a point relative toa multivariate sample. Since then other depth functions have been introduced, includingsimplicial depth (Liu, 1990) and projection depth (Zuo and Serfling, 2000a).

Several authors have used depth in the context of classification. Christmann andRousseeuw (2001) and Christmann et al. (2002) applied regression depth (Rousseeuwand Hubert, 1999). The maximum depth classification rule of Liu (1990) was studied byGhosh and Chaudhuri (2005) and extended by Li et al. (2012). Dutta and Ghosh (2011)used projection depth.

In this paper we will present a novel technique called classification in distance space.It aims to provide a fully non-parametric tool for the robust supervised classificationof possibly skewed multivariate data. In Sections 2 and 3 we will describe the keyconcepts needed for our construction. Section 4 discusses some existing multivariateclassifiers and introduces our approach. A thorough simulation study for multivariatedata is performed in Section 5. From Section 6 onwards we will focus our attentionon the increasingly important framework of functional data, the analysis of which is arapidly growing field. We will start by a general description, and then extend our workon multivatiate classifiers to functional classifiers.

2 Multivariate depth and distance measures

2.1 Halfspace depth

If Y is a random variable on Rp with distribution PY , then the halfspace depth of anypoint x ∈ Rp relative to PY is defined as the minimal probability mass contained in aclosed halfspace with boundary through x:

HD(x;PY ) = inf||v||=1

PY{v′Y > v′x

}.

Halfspace depth satisfies the requirements of a statistical depth function as formulatedby Zuo and Serfling (2000a): it is affine invariant, attains its maximum value at thecenter of symmetry if there is one, is monotone decreasing along rays emanating fromthe center, and vanishes at infinity.

For any statistical depth function D and for any α ∈ [0, 1] the α-depth region Dα isthe set of points whose depth is at least α:

Dα = {x ∈ Rp ; D(x;PY ) > α} . (1)

The boundary of Dα is known as the α-depth contour. The halfspace depth regions areclosed, convex, and nested for increasing α. The halfspace median (or Tukey median) isdefined as the center of gravity of the smallest non-empty depth region, i.e. the regioncontaining the points with maximal halfspace depth. The finite-sample definitions of thehalfspace depth, the Tukey median and the depth regions are obtained by replacing PYby the empirical probability distribution Pn.

Properties of halfspace depth have been studied extensively. Donoho and Gasko(1992) derived many finite-sample properties, including the breakdown value of the

2

Tukey median, whereas Romanazzi (2001) derived the influence function of the half-space depth. Masse and Theodorescu (1994) and Rousseeuw and Ruts (1999) studiedseveral properties of the depth function and its contours at the population level. Heand Wang (1997) and Zuo and Serfling (2000b) studied the convergence of the depthregions and the contours. Continuity of the depth contours and the Tukey median wasinvestigated by Mizera and Volauf (2002).

To compute the halfspace depth, several affine invariant algorithms have been de-veloped. Rousseeuw and Ruts (1996) and Rousseeuw and Struyf (1998) provided exactalgorithms in two and three dimensions and an approximate algorithm in higher dimen-sions. Algorithms to compute the halfspace median have been developed by Rousseeuwand Ruts (1998) and Struyf and Rousseeuw (2000). To compute the depth contours thealgorithm of Ruts and Rousseeuw (1996) can be used in the bivariate setting, whereasthe algorithms constructed by Hallin et al. (2010) and Paindaveine and Siman (2012)are applicable to at least p = 5. Recently Liu et al. (2014) were able to go up to p = 9.

The concept of halfspace depth has led to alternative descriptions of data sets. Forexample Struyf and Rousseeuw (1999) showed that the halfspace depth function of asample completely characterizes its empirical distribution, and this was later extendedto the population case (Kong and Zuo, 2010). Rousseeuw and Struyf (2004) developeda test for angular symmetry of multivariate data based on halfspace depth.

2.2 The bagplot

The bagplot of Rousseeuw et al. (1999) generalizes the univariate boxplot to bivariatedata, as illustrated in Figure 1. The dark-colored bag is the smallest depth region with atleast 50% probability mass, i.e. B = Dα such that PY (B) > 0.5 and PY (Dα) < 0.5 forall α > α. Inside the bag we see the halfspace median. The fence, which itself is neverdrawn, is obtained by inflating the bag by a factor 3 relative to the median, and the datapoints outside of it are flagged as outliers and plotted as stars. The light-colored loop isthe convex hull of the data points inside the fence. The bagplot has the nice propertynot to depend on any symmetry assumption: it works just as well for skewed data. Thatis, the bag need not be elliptically shaped and the median need not be positioned in themiddle of the bag. In Figure 1 we see a moderate deviation from symmetry.

2.3 The bagdistance

We propose to compute a statistical distance of a multivariate point x ∈ Rp to PY ,based on halfspace depth. This distance uses both the center and the dispersion of PY .To account for the dispersion it uses the bag B defined above. Next, c(x) := cx isdefined as the intersection of the boundary of B and the ray from the halfspace medianθ through x. The bagdistance of x to Y is then given by the ratio between the Euclideandistance of x to the halfspace median and the Euclidean distance of cx to the halfspacemedian:

bd(x;PY ) =

{0 if x = θ‖x− θ‖ / ‖cx − θ‖ elsewhere.

(2)

3

100

150

200

250

300

2000 2500 3000 3500

Bagplot based on halfspace depth

Figure 1: Bagplot of a bivariate data set.

The denominator in (2) accounts for the dispersion of PY in the direction of x. Notethat the bagdistance does not assume symmetry and is affine invariant.

x1

x2

−2

0

2

−3 −1 1

Figure 2: Illustration of the bagdistance between an arbitrary point and a sample.

The finite-sample definition is similar and is illustrated in Figure 2. The bag is shownin gray. For two points x1 and x2 their Euclidean distance to the halfspace median (reddiamond) is marked with dark blue lines, whereas the orange lines correspond to thedenominator of (2) and reflect how these distances will be scaled. Whereas the lengthsof the blue lines are the same, the bagdistance of x1 is 2.01 and that of x2 is only 0.63.These values reflect the position of the points relative to the sample, one lying somewhataway from the most central half of the data and the other lying well within the central

4

part. Note that the bagdistance is implicitly used in the bagplot, as the fence consistsof the points whose bagdistance is at most 3.

We will now provide some the properties of the bagdistance. We define a generalizednorm as a function g : Rp → [0,∞[ such that g(0) = 0 and g(x) 6= 0 for x 6= 0, whichsatisfies g(γx) = γg(x) for all x and all γ > 0. In particular, for a gaussian distributionN(0,Σ) with positive definite Σ it holds that

g(x) =√x′Σ−1x (3)

is a generalized norm (and even a norm).Now suppose we have a compact set B which is star-shaped about zero, i.e. for all

x ∈ B and 0 6 γ 6 1 it holds that γx ∈ B. For every x 6= 0 we then construct thepoint cx as the intersection between the boundary of B and the ray emanating from 0in the direction of x. Let us assume that 0 is in the interior of B, that is, there existsε > 0 such that the ball B(0, ε) ⊂ B. Then ‖cx‖ > 0 whenever x 6= 0. Now define

g(x) =

{0 if x = 0‖x‖‖cx‖ otherwise.

(4)

Note that we do not really need the Euclidean norm, as we can equivalently define g(x)as the γ > 0 such that γ−1x lies on the boundary of B. We can verify that g(·) is ageneralized norm, which need not be a continuous function.

Theorem 1. If the set B is convex and compact and 0 ∈ int(B) then the function gdefined in (4) is a convex function and hence continuous.

Proof. We need to show that

g(λx+ (1− λ)y) 6 λg(x) + (1− λ)g(y)

for any x,y ∈ Rp and 0 6 λ 6 1. In case {0,x,y} are collinear the function g restrictedto this line is 0 in the origin and goes up linearly in both directions (possibly withdifferent slopes) so it is convex. If {0,x,y} are not collinear they form a triangle. Notethat we can write x = g(x)cx and y = g(y)cy and we will denote z := λx + (1 − λ)y.We can verify that z := (λg(x) + (1−λ)g(y))−1z is a convex combination of cx and cy.By compactness of B we know that cx, cy ∈ B, and from convexity of B it then followsthat z ∈ B. Therefore

‖cz‖ = ‖cz‖ > ‖z‖

so that finally

g(z) =‖z‖‖cz‖

6‖z‖‖z‖

= λg(x) + (1− λ)g(y) .

5

It follows that g satisfies the triangle inequality since

g(x+ y) = 2g(1

2x+

1

2y) 6 2

1

2g(x) + 2

1

2g(y) = g(x) + g(y) .

Therefore g (and thus the bagdistance) satisfies the conditions(i) g(x) > 0 for all x ∈ Rp(ii) g(x) = 0 implies x = 0(iii) g(γx) = γg(x) for all x ∈ Rp and γ > 0(iv) g(x+ y) 6 g(x) + g(y) for all x,y ∈ Rp .

This is almost a norm, in fact, it becomes a norm if we were to add

g(−x) = g(x) for all x ∈ Rp .

The generalization makes it possible for the bagdistance to reflect asymmetric dispersion.(We could easily turn it into a norm by computing h(x) = (g(x) + g(−x))/2 but thenwe would lose that ability.)

Also note that the function g defined in (4) does generalize the Mahalanobis dis-tance in (3), as can be seen by taking B =

{x; x′Σ−1x 6 1

}which implies cx =

(x′Σ−1x)−1/2x for all x 6= 0 so g(x) = ‖x‖ /((x′Σ−1x)−1/2 ‖x‖) =√x′Σ−1x .

Finally note that Theorem 1 holds whenever B is a convex set. Instead of Tukeydepth regions we could also use projection depth or the depth function in Section 3.

In the univariate case, the compact convex set B in Theorem 1 becomes a closedinterval which we can denote by B = [−1

b ,1a ] with a, b > 0, so that

g(x) = ax+ + bx− .

In linear regression the minimization of∑n

i g(ri) yields the a/(a+ b) regression quantileof Koenker and Bassett (1978).

It is straightforward to extend Theorem 1 to a nonzero center by subtracting thecenter first.

To compute the bagdistance of a point x with respect to a p-variate sample we canfirst compute the bag and then the intersection point cx. In low dimensions computingthe bag is feasible, and it is worth the effort if the bagdistance needs to be computed formany points. In higher dimensions computing the bag is harder, and then a simpler andfaster algorithm is to search for the multivariate point c∗ on the ray from θ through xsuch that

HD(c∗;Pn) = mediani{HD(yi;Pn)}.

where yi are the data points. Since HD is monotone decreasing on the ray this can bedone fairly fast, e.g. by means of the bisection algorithm.

3 Skew-adjusted projection depth

Since the introduction of halfspace depth various other affine invariant depth functionshave been defined (for an overview see e.g. Mosler (2013)), among which projection

6

depth (Zuo, 2003) which is essentially the inverse of the Stahel-Donoho outlyingness(SDO). The SDO (Stahel, 1981; Donoho, 1982) is based on the geometric insight that amultivariate outlier should be outlying in at least one direction. The idea is to projectthe data onto many lines and to use a robust univariate measure of outlyingness on theprojections. The population SDO of an arbitrary point x with respect to a randomvariable Y with distribution PY is defined as

SDO(x;PY ) = sup||v||=1

| v′x−median(v′Y ) |MAD(v′Y )

from which the projection depth is derived:

PD(x;PY ) =1

1 + SDO(x;PY ). (5)

Since the SDO has an absolute deviation in the numerator and uses the MAD in itsdenominator it is best suited for symmetric distributions. For asymmetric distributionsBrys et al. (2005) proposed the adjusted outlyingness (AO) in the context of robustindependent component analysis. The AO uses the medcouple (MC) of (Brys et al., 2004)as a robust measure of skewness. The medcouple of a univariate dataset Z = {z1, . . . , zn}is defined as

MC(z1, . . . , zn) = mediani,j

(zj −mediank zk)− (mediank zk − zi)zj − zi

(6)

where i and j have to satisfy zi 6 mediank(zk) 6 zj and zi 6= zj . Note that −1 6 MC 61. For symmetric distributions MC = 0, whereas MC > 0 indicates right skewness andMC < 0 indicates left skewness.

The adjusted outlyingness AO is defined as

AO(x;PY ) = sup||v||=1

AO1(v′x;Pv′Y )

where the univariate adjusted outlyingness AO1 is given by

AO1(z;Z) =

{z−median(Z)

w2(Z)−median(Z) if z > median(Z)median(Z)−z

median(Z)−w1(z)if z 6 median(Z) .

(7)

Here

w1(Z) = Q1(Z)− 1.5 e−4MC(Z) IQR(Z)

w2(Z) = Q3(Z) + 1.5 e+3MC(Z) IQR(Z)

if MC(Z) > 0, where Q1(Z) and Q3(Z) denote the first and third quartile of Z, andIQR(Z) = Q3(Z) − Q1(Z). If MC(Z) < 0 we replace (z, Z) by (−z,−Z). The denom-inator of (7) corresponds to the fence of the univariate adjusted boxplot proposed by

7

Hubert and Vandervieren (2008). Other applications of the AO were studied by Hubertand Van der Veeken (2008, 2010).

The skew-adjusted projection depth (SPD) is now defined as

SPD(x;PY ) =1

1 + AO(x;PY ). (8)

To compute the finite-sample SPD we have to rely on approximate algorithms as it isinfeasible to consider all directions v. A convenient affine invariant procedure is obtainedby considering directions v which are orthogonal to an affine hyperplane through p+ 1randomly drawn data points. In our implementation we use 250p directions.

4 Multivariate classifiers

4.1 Existing methods

One of the oldest nonparametric classifiers is the k-nearest neighbor (kNN) methodintroduced by Fix and Hodges (1951). For each new observation the method looks upthe k training data points closest to it (typically in Euclidean distance), and then assignsit to the most prevalent group among those neighbors. The value of k is typically chosenby cross-validation to minimize the misclassification rate.

Liu (1990) proposed to assign a new observation to the group in which it has thehighest depth. This maxDepth rule is simple and can be applied to more than 2 groups.On the other hand it often yields ties when the depth function is identically zero on largedomains, as is the case with halfspace depth and simplicial depth. Dutta and Ghosh(2011) avoid this problem by using projection depth instead.

To improve on the maxDepth rule, Li et al. (2012) introduced the DepthDepth classi-fier as follows. Assume that there are two groups, and denote the empirical distributionsof the training groups as P1 and P2. Then transform any data point x ∈ Rp to thebivariate point

(depth(x, P1), depth(x, P2)) (9)

where depth is a statistical depth function. These bivariate points form the so-calleddepth-depth plot, in which the two groups of training points are colored differently. Theclassification is then performed on this plot. The maxDepth rule corresponds to sepa-rating according to the 45 degree line through the origin, but in general Li et al. (2012)calculate the best separating polynomial. Next, they assign a new observation to group1 if it lands above the polynomial, and to group 2 otherwise. Some disadvantages of thedepth-depth rule are the computational complexity of finding the best separating poly-nomial and the need for majority voting when there are more than two groups. Langeet al. (2014) carry out a depth transform followed by kNN classification.

4.2 Classification in distance space

It has been our experience that distances can be very useful in classification, but we donot want to give up the affine invariance that depth enjoys. Therefore, we propose to use

8

the bagdistance of Section 2.3 for this purpose, or alternatively the adjusted outlyingnessof Section 3. Both are affine invariant, robust against outliers in the training data, andsuitable also for skewed data.

Suppose that there are G groups, where G may be higher than 2. Let Pg representthe empirical distribution of the training data from group g = 1, . . . , G. Instead of thedepth transform (9) we now transform the point x ∈ Rp to the G-variate point

(dist(x, P1), . . . , dist(x, PG)) (10)

where dist(x, Pg) is a generalized distance or an outlyingness measure of the point xto the g-th training sample. Note that the dimension G may be smaller, equal, orlarger than the original dimension p. After the distance transform (10) any multivariateclassifier may be applied, such as linear or quadratic discriminant analysis. The simplestversion is of course minDist, which just assigns x to the group with smallest coordinatein (10). However, our default choice is to apply kNN to the transformed points. Thiscombines the simplicity and robustness of kNN with the affine invariance offered by thetransformation. Also note that we never need to resort to majority voting. In simulations(Section 5) we have seen that the proposed DistSpace method (i.e. the distance transform(10) followed by kNN) works quite well.

We now illustrate the distance transformation on a real world example, available fromthe UCI Machine Learning Repository (Bache and Lichman, 2013). The data originatedfrom an authentication procedure of banknotes. Photographs of 762 genuine and 610forged banknotes were processed using wavelet transformations, and four features wereextracted. These are the 4 coordinates shown in the scatterplot matrix in Figure 3.

Note that G = 2. Using bagdistance, the distance space of this data is Figure 4. Itshows that forged and authentic banknotes are well-separated and that the authenticbanknotes form a tight cluster compared to that of the forged ones. The halfspace depth-depth plot of the banknote data is Figure 5. It immediately shows that the halfspacedepth of points from one group with respect to the other group is zero. This indicatesthat the convex hulls of both groups do not intersect, which one could not tell fromFigure 3. However, this does not help when one has to classify a new data point lyingoutside both hulls.

5 Simulation

To evaluate the various classifiers we perform a simulation. In each run the trainingset contains 50 observations per group, and the test set contains 500 observations. Toevaluate performance we calculate the average misclassification percentage

∑Gg=1 egng/N

with eg the percentage of misclassified observations of group g in the test set, ng thenumber of observations of group g in the training set, and N the total size of the trainingset. This weights the misclassification percentages in the test set according to the priorprobabilities. This procedure is repeated 2000 times for each setting.

Setting 1: Trivariate normals (G = 3). We generate data from three different

9

−2 −1 0 1 −4 −2 0 1 2

−2

−1

01

2

−2

−1

01

−1

01

23

45

−2 −1 0 1 2

−4

−2

01

2

−1 0 1 2 3 4 5

Figure 3: Scatterplot matrix of the banknote data.

normal distributions. The first group C1 has parameters

µ1 =

000

and Σ1 =

5 3 13 2 11 1 3

The second group is generated like C1 but we flip the sign of the second coordinate. Thethird group is again generated like C1 but then shifted by the vector (1,−2,−4).

Setting 2: Bivariate normal and skewed (G = 2). We consider two bivariatedistributions. The first group C1 is a draw from the standard normal distribution. Thecoordinates in the second group are independent draws from the exponential distribution

10

0

2

4

6

0 2 4 6

Distance to Forged

Dis

tanc

e to

Aut

hent

ic

Figure 4: The distance-distance plot ofthe banknote data (based on bagdis-tance).

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Depth to Forged

Dep

th to

Aut

hent

ic

Figure 5: The depth-depth plot ofthe banknote data (based on halfspacedepth).

with rate parameter 1:

C1 ∼ N

([00

],

[1 00 1

])C2 ∼

[Exp(1)Exp(1)

].

Setting 3: Gamma distributions (G = 3). The third setting consists of threegroups of trivariate data. Consider the stochastic vector X of independent coordinatesthat are Γ-distributed with the following shape and rate parameters:

X ∼

Γ(3, 1)Γ(2, 2)Γ(2, 2)

.The three groups are then formed as separate draws from X, X + (1.5, 1.5, 1.5)′, and(X1,−X2, X3) + (0, 0.5, 0)′.

Setting 4: Banknote data (G = 2). We first robustly standardize the data usingthe columnwise median and MAD. The training set is then a random subset of 500 datapoints, and the test set consists of the remaining 872 observations.

In the depth-based classification rules, halfspace depth (HD) is compared to pro-jection depth (PD) and skew-adjusted projection depth (SPD). In the distance-basedclassifiers, the bagdistance (bd) is compared to the Stahel-Donoho outlyingness (SDO)and the adjusted outlyingness (AO).

Figure 6 shows the results for setting 1. The leftmost classifier is kNN, for whichthe vertical boxplot indicates a median misclassification percentage around 15%. Notethat the second boxplot for that method (in orange) shows the situation where 5% of

11

kNN MaxDepth DepthDepth + poly DepthDepth + kNN MinDist DistSpace

10

20

30

40

50

kNN HD PD SPD HD PD SPD HD PD SPD bd SDO AO bd SDO AO

Figure 6: Misclassification percentages in 2000 runs of setting 1 (trivariate normal). Theresults for clean data are shown in gray, for 5% mislabeled data in orange, and for 10%mislabeled data in blue.


10

20

30

40

50


Figure 7: Misclassification percentages in 2000 runs of setting 2 (bivariate normal andskewed) using the same color code.


10

20

30

40

50


Figure 8: Misclassification percentages in 2000 runs of setting 3 (trivariate Γ) using thesame color code.

12


0

20

40

60


Figure 9: Misclassification percentages in 2000 runs of setting 4 (banknote data) usingthe same color code.

the observations in the training set were mislabeled, i.e. attributed to another group.The third boxplot (in blue) corresponds to misclassifying 10%.

The next classifier is maxDepth using the halfspace depth HD, which performed quitepoorly. We see that maxDepth with PD and SPD do much better, because unlike HDtheir depths remain nonzero and informative in points outside the convex hull of theirtraining group.

The third panel shows the DepthDepth classifier with separating polynomial, followedby majority voting since G > 2. The R code was kindly provided by Li et al. (2012).Here the HD-based version looks better, but this is because the code defaults to thekNN classification in the original data space for points it cannot classify.

The fourth panel shows what happens if the depth transform (9) is followed by thekNN classifier on the transformed data, for which we used our own code.

Next we see the minDist method, in which the bagdistance (bd) did best for uncon-taminated data, but with mislabeled data its performance drops. The results of minDistwith SDO and AO are identical by construction to those of maxDepth with PD and SPD.

Finally we see the results of classification in distance space, in which kNN is appliedto the transformed data.

The results for setting 2 are in Figure 7. Here the DepthDepth rules clearly out-perform maxDepth, and DistSpace similarly trumps minDist. As this setting contains askewed group, bd and AO have a natural advantage over SDO which assumes symmetry.The best classification results are obtained by DepthDepth for SPD and DistSpace withbd, closely followed by kNN in the original space.

The results of setting 3 are similar to those of setting 1. Again HD performs poorly,and DepthDepth outperforms maxDepth just like DistSpace outperforms minDist. Alsohere kNN is hard to beat, but DistSpace does a bit better.

The results for the banknote data are in line with the previous ones. The lowestmisclassification percentages are obtained by DepthDepth using PD and SPD, and byDistSpace.

13

6 Functional data

The analysis of functional data is a booming research area of statistics, see e.g. the booksof Ramsay and Silverman (2005) and Ferraty and Vieu (2006). A functional data settypically consists of n curves observed at time points t1, . . . , tT . The value of a curve ata given time point is a p-variate vector of measurements. We call the functional datasetunivariate or multivariate depending on p. For instance, the multi-lead ECG data setanalyzed by Pigoli and Sangalli (2012) is multivariate.

When faced with classification of functional data, one approach is to consider it asmultivariate data in which the measurement(s) at different time points are separate vari-ables. This yields high-dimensional data with typically many highly correlated variables,which can be dealt with by penalization (Hastie et al., 1995). Another approach is toproject such data onto a lower-dimensional subspace and to continue with the projecteddata. Baıllo et al. (2011) apply functional kNN classification, but other approaches canbe used such as support vector machines (Rossi and Villa, 2006; Martin-Barragan et al.,2014). Recently Li and Yu (2008) proposed to use F -statistics to select small subinter-vals in the domain and to restrict the analysis to those. Other techniques include theweighted distance method of Alonso et al. (2012) and the componentwise approach ofDelaigle et al. (2012).

The study of robust methods for functional data started only recently. Febrero-Bande et al. (2008) proposed an outlier detection procedure for univariate functionaldata based on depth and bootstrap. It turns out that outliers in functional data maymanifest themselves in a variety of ways, see e.g. Hubert et al. (2015). So far, effortsto construct robust classification rules for functional data have mainly used the conceptof depth: Lopez-Pintado and Romo (2006) used the modified band depth, Cuevas et al.(2007) employed projection-based depth, and Cuesta-Albertos and Nieto-Reyes (2010)made use of random Tukey depth.

6.1 Functional depths and distances

Claeskens et al. (2014) proposed a type of multivariate functional depth (MFD) asfollows. Consider a p-variate stochastic process Y = {Y (t), t ∈ U}, a statistical depthfunction D(·, ·) on Rp, and a weight function w on U integrating to 1. Then the MFDof a curve X on U with respect to the distribution P Y is defined as

MFD(X;PY ) =

∫UD(X(t);PY (t))w(t) dt (11)

where PY (t) is the distribution of Y at time t. The weight function w(t) allows to em-phasize or downweight certain time regions, but in this paper will be assumed constant.The functional median Θ(t) is defined as the curve with maximal MFD. Properties ofthe MFD may be found in (Claeskens et al., 2014), with emphasis on the case whereD(·, ·) is the halfspace depth. For ease of notation and to draw quick parallels to themultivariate non-functional case, we will denote the MFD based on halfspace depth by

14

fHD, and the MFD based on projection depth and skew-adjusted projection depth byfPD and fSPD.

Analogously, we can define the functional bagdistance (fbd) of a curve X to (thedistribution of) a stochastic process Y as

fbd(X;PY ) =

∫Ubd(X(t);PY (t)) dt . (12)

Similar extensions of the Stahel-Donoho outlyingness SDO and the adjusted outlyingnessAO to the functional context are given by

fSDO(x;PY ) =

∫U

AO(X(t);PY (t)) dt

fAO(x;PY ) =

∫U

SDO(X(t);PY (t)) dt .

6.2 Functional classifiers

The classifiers discussed in Section 4 are readily adapted to functional data. By simplyplugging in the functional versions of the distances and depths all procedures can becarried over. For the k-nearest neighbor method one typically uses the L2-distance:

d2(X1, X2) =( ∫

U‖X1(t)−X2(t)‖2 dt

)1/2.

The functional kNN method will be denoted as fkNN. It is simple but not affine invariant,and does not account for differences in variability between the G groups.

Analogously we use the maxDepth and DepthDepth rules based on fHD, fPD, andfSPD, as well as the MinDist and DistSpace rules based on fbd, fSDO, and fAO. Moslerand Mozharovskyi (2014) already studied DepthDepth on functional data after applyinga dimension-reduction technique.

7 Simulation with functional data

Again we consider three different settings. To simulate the noise one typically observesin real life data, an error term ε(t) is added which is a Gaussian process with zero meanand covariance function γ(s, t) = exp(−δ1|t − s|δ2). The functional observations havebeen sampled on 31 equally spaced points spanning the interval [0, 1]. Once again thetraining data consists of 50 observations of each group and the testing data consists of500 observations of each group.

Setting 1: Pencil inside a hollow tube:

f(t) = Uf

[cos(Vf )sin(Vf )

]+ ε(t) and g(t) = Ug

[cos(Vg)sin(Vg)

]+ ε(t) .

15

The radii Uf and Ug are drawn from the uniform distributions on [0, 1] and [2.8, 3.2].The angles Vf and Vg are chosen randomly from the uniform distribution on [0, 2π] andthe noise parameters are δ1 = 1 and δ2 = 2.5.

Setting 2: Straight lines with different slopes:

f(t) = 4t+ ε(t) and g(t) = 4.5t+ ε(t) .

We also include the derivatives of f and g, so the data are bivariate. Again δ1 = 1 andδ2 = 2.5. Similar settings have been considered by Lopez-Pintado and Romo (2006) andAlonso et al. (2012).

Setting 3: Quadratic curves:

f(t) = (t+ Uf )2 +1

5ε(t) and g(t) = t2 + Ug +

1

5ε(t)

where Uf and Ug are random draws from the uniform distribution on [0, 1], δ1 = 0.25and δ2 = 2. Again we obtain bivariate data by including the derivatives. Alonso et al.(2012) used a similar setting.

To study the robustness of the classification, the training data of each setting wascontaminated in four ways:• C1,α : a fixed fraction α of each training group is shifted by 5 in each coordinate

of the original functions (not the derivatives if these are used).• C2,α : a fraction α is shifted by 5 or -5 (randomly).• C3,α : a fraction α is shifted by 5 or -5 (randomly) but only for 0 6 t 6 T0 with

random T0.• C4,α : a fraction α is given the wrong group label.

Each combination of a setting and a contamination type was run 200 times. Many moresettings and contaminations were tried, but those shown here are representative.

fkNN MaxDepth DepthDepth + poly DepthDepth + kNN MinDist DistSpace

0

20

40

60

fkNN fHD fPD fSPD fHD fPD fSPD fHD fPD fSPD fbd fSDO fAO fbd fSDO fAO

Figure 10: Misclassification percentages in functional setting 1, averaged over 200 runseach. Results for clean data are shown in gray, results for C1,0.10 contamination inorange, and for C4,0.10 contamination in blue.

In functional setting 1 the two groups are not convex and well-separated, in factgroup 1 lies inside group 2 which is not convex. In this situation we do not expect our

16

methods to work well, whereas kNN is local in nature which makes it well-suited asseen in Figure 10. MaxDepth and MinDist basically do not work in this setting, as theirmisclassification rate of about 50% is no better than random allocation. On the otherhand DepthDepth and DistSpace do surprisingly well when followed by kNN, the latterbeing able to pick up the local structure of the transformed data.


0

5

10

15

20



The results for functional setting 2 in Figure 11 indicate that maxDepth and DepthDepthdo reasonably well and are competitive with kNN except when using HD. Still, MinDistand DistSpace slightly outperform them. Overall the effect of the contamination is small,except for fSPD and fAO under C4,0.10.


0

5

10

15

20



In functional setting 3 the DepthDepth versions outperform maxDepth and DistSpaceoutperforms MinDist, which suffers the most from contamination here.

17

We now turn our attention to some real data sets. The Tecator data set is a bench-mark of functional data analysis. It consists of near infrared spectra of pure meatsamples obtained on a Tecator Infratec Food and Feed analyzer. For each meat samplea 100-channel spectrum in the 850–1050 nm wavelength range was recorded. A totalof 215 spectra are available, which are typically split into low-fat and high-fat meatsamples. We combined the raw data with its first and second derivative, resulting inthree-dimensional functional data shown in Figure 13.

2

3

4

5

850 900 950 1000 1050wavelength

abso

rban

ce

−0.02

0.00

0.02

0.04

850 900 950 1000 1050wavelength

first

der

ivat

ive

−0.004

−0.002

0.000

0.002

0.004

850 900 950 1000 1050wavelength

seco

nd d

eriv

ativ

e

Figure 13: The 3 coordinates of the Tecator data.

In our experiment the training data is a random subset of 120 functions and theremainder form the test set. Results are shown in Figure 14. Here fkNN did quitepoorly, which can be explained by the different variabilities of components, the secondderivative (which happens to contain much useful information) being less variable thanthe first derivative, in its turn less variable than the original function. The affine invariantdepth and distance transforms remedy this. The DepthDepth methods do well exceptwith HD, whereas the DistSpace classifier performs well for all three distances. Theaverage misclassification rate of DistSpace is 3.6% with fbd, 2.7% with fSDO, and 3.2%with fAO. For comparison, Rossi and Villa (2006) applied support vector machines tothe second derivative yielding 3.38% misclassification. Alonso et al. (2012) report 2.02%,but their procedure is not affine invariant. Li and Yu (2008) obtained 1.09% from the

18

second derivative, using a segmentation approach to select a subset of the domain wherethe two groups are the most separated.


0

10

20

30

40


Figure 14: Misclassification percentages in the Tecator data, averaged over 200 runs.

The writing dataset consists of 2858 character samples corresponding to the speedprofile of the tip of a pen writing different letters, as captured on a WACOM tablet.The data came from the UCI Machine Learning Repository (Bache and Lichman, 2013).We added the x- and y-coordinates of the pen tip (obtained by integration) to the data,yielding p = 4 overall. We further processed the data by removing the first and lasttime points and by interpolating to give all curves the same time domain. Samplescorresponding to the letters ‘a’, ‘c’, ‘e’, ‘h’ and ‘m’ were retained. This yields a five-group supervised classification problem of four-dimensional functional data. Figure 15plots the x- and y-coordinate curves with the 5 groups shown in different colors.

For each letter the training set was a random subset of 80 functions. The outcome isin Figure 16. We do not have results for the original DepthDepth classifier with separatingpolynomials and majority voting due to computation time restrictions. MaxDepth andDepthDepth combined with fkNN perform well except for fHD, again due to the factthat HD is zero outside the convex hull. DistSpace outperforms MinDist, and works wellwith all three distances. The best result was obtained by DistSpace with fbd.

8 Conclusion

Many classification rules for multivariate or functional data, like kNN, often work wellbut can fail when the dispersion of the data depends strongly on the direction in which itis measured. The MaxDepth rule of Liu (1990) and its DepthDepth extension (Li et al.,2012) resolve this by their affine invariance, but perform poorly in combination withdepth functions that become zero outside the convex hull of the data, like HD.

This is why we propose to use the bagdistance bd, which is based on HD and hasproperties very close to those of a norm but is able to reflect asymmetry. Rather thantransforming the data to their depths we propose the DistSpace transform, based on bdor a measure of outlyingness such as SDO or AO.

19

−50

−25

0

25

50

75

0 20 40 60time

x−co

ordi

nate

−50

−25

0

25

50

0 20 40 60time

y−co

ordi

nate

Figure 15: Coordinates of the writing data. Each group has a different color.

After applying DepthDepth or DistSpace there are many possible ways to classify thetransformed data. We found that the original separating polynomial method did notperform the best, and is time consuming due to the selection of the polynomial and theneed for majority voting when there are more than 2 groups.

In our experiments with real and simulated data we found that the best performingmethods overall were DepthDepth and DistSpace with kNN applied to the transformeddata. This approach combines affine invariance with the simplicity, lack of assumptions,and robustness of kNN, and works well for both multivariate and functional data.

References

Alonso, A., Casado, D., and Romo, J. (2012). Supervised classification for functionaldata: a weighted distance approach. Computational Statistics & Data Analysis,56:2334–2346.

Bache, K. and Lichman, M. (2013). UCI Machine Learning Repository,https://archive.ics.uci.edu/ml/datasets.html

20


0

2

4

6

8


Figure 16: Misclassification percentages in the writing data, averaged over 200 runs.

Baıllo, A., Cuevas, A., and Cuesta-Albertos, J. A. (2011). Supervised classification fora family of gaussian functional models. Scandinavian Journal of Statistics, 38(3):480–498.

Brys, G., Hubert, M., and Rousseeuw, P. J. (2005). A robustification of independentcomponent analysis. Journal of Chemometrics, 19:364–375.

Brys, G., Hubert, M., and Struyf, A. (2004). A robust measure of skewness. Journal ofComputational and Graphical Statistics, 13:996–1017.

Christmann, A., Fischer, P., and Joachims, T. (2002). Comparison between various re-gression depth methods and the support vector machine to approximate the minimumnumber of misclassifications. Computational Statistics, 17:273–287.

Christmann, A. and Rousseeuw, P. J. (2001). Measuring overlap in logistic regression.Computational Statistics & Data Analysis, 37:65–75.

Claeskens, G., Hubert, M., Slaets, L., and Vakili, K. (2014). Multivariate functionalhalfspace depth. Journal of the American Statistical Association, 109(505):411–423.

Cuesta-Albertos, J.A. and Nieto-Reyes, A. (2010). Functional classification and therandom Tukey depth: Practical issues. In: Borgelt, C., Rodrıguez, G.G., Trutschnig,W., Lubiano, M.A., Angeles Gil, M., Grzegorzewski, P. and Hryniewicz, O. (eds.)Combining Soft Computing and Statistical Methods in Data Analysis. Springer, BerlinHeidelberg, pages 123-–130.

Cuevas, A., Febrero, M., and Fraiman, R. (2007). Robust estimation and classification forfunctional data via projection-based depth notions. Computational Statistics, 22:481-–496.

Delaigle, A., Hall, P., and Bathia, N. (2012). Componentwise classification and clusteringof functional data. Biometrika, 99:299–313.

21

Donoho, D. (1982). Breakdown properties of multivariate location estimators. Ph.D.Qualifying paper, Dept. Statistics, Harvard University, Boston.

Donoho, D. and Gasko, M. (1992). Breakdown properties of location estimates based onhalfspace depth and projected outlyingness. The Annals of Statistics, 20(4):1803–1827.

Dutta, S. and Ghosh, A. (2011). On robust classification using projection depth. Annalsof the Institute of Statistical Mathematics, 64:657–676.

Febrero-Bande, M., Galeano, P., and Gonzalez-Manteiga, W. (2008). Outlier detectionin functional data by depth measures, with application to identify abnormal NOx

levels. Environmetrics, 19(4):331–345.

Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory andPractice. Springer, New York.

Fix, E. and Hodges, J. L. (1951). Discriminatory analysis – nonparametric discrimina-tion: Consistency properties. Technical Report 4 USAF School of Aviation Medicine,Randolph Field, Texas.

Ghosh, A. and Chaudhuri, P. (2005). On maximum depth and related classifiers. Scan-dinavian Journal of Statistics, 32(2):327–350.

Hallin, M., Paindaveine, D., and Siman, M. (2010). Multivariate quantiles and multiple-output regression quantiles: from L1 optimization to halfspace depth. The Annals ofStatistics, 38(2):635–669.

Hastie, T., Buja, A., and Tibshirani, R. (1995). Penalized discriminant analysis. TheAnnals of Statistics, 23(1):73–102.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning.Springer, New York, second edition.

He, X. and Wang, G. (1997). Convergence of depth contours for multivariate datasets.The Annals of Statistics, 25(2):495–504.

Hubert, M., Rousseeuw, P. J., and Segaert, P. (2015). Multivariate functional outlierdetection. Statistical Methods and Applications, doi 10.1007/s10260-015-0297-8.

Hubert, M. and Van der Veeken, S. (2008). Outlier detection for skewed data. Journalof Chemometrics, 22:235–246.

Hubert, M. and Van der Veeken, S. (2010). Robust classification for skewed data. Ad-vances in Data Analysis and Classification, 4:239–254.

Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions.Computational Statistics & Data Analysis, 52(12):5186–5201.

Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46:33–50.

22

Kong, L. and Zuo, Y. (2010). Smooth depth contours characterize the underlying dis-tribution. Journal of Multivariate Analysis, 101:2222–2226.

Lange, T., Mosler, K., and Mozharovskyi, P. (2014). Fast nonparametric classificationbased on data depth. Statistical Papers, 55(1):49–69.

Li, B. and Yu, Q. (2008). Classification of functional data: A segmentation approach.Computational Statistics & Data Analysis, 52(10):4790 – 4800.

Li, J., Cuesta-Albertos, J., and Liu, R. (2012). DD-classifier: nonparametric classifi-cation procedure based on DD-plot. Journal of the American Statistical Association,107:737–753.

Liu, R. (1990). On a notion of data depth based on random simplices. The Annals ofStatistics, 18(1):405–414.

Liu, X., Mosler, K., and Mozharovskyi, P. (2014). Fast computation of Tukey trimmedregions in dimension p > 2. arXiv:1412.5122 [stat.CO].

Lopez-Pintado, S. and Romo, J. (2006). Depth-based classification for functional data. InData depth: Robust Multivariate Analysis, Computational Geometry and Applications,volume 72 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci., pages 103–119.Amer. Math. Soc., Providence, RI.

Maronna, R., Martin, D., and Yohai, V. (2006). Robust Statistics: Theory and Methods.Wiley, New York.

Martin-Barragan, B., Lillo, R., and Romo, J. (2014). Interpretable support vector ma-chines for functional data. European Journal of Operational Research, 232(1):146–155.

Masse, J.-C. and Theodorescu, R. (1994). Halfplane trimming for bivariate distributions.Journal of Multivariate Analysis, 48(2):188–202.

Mizera, I. and Volauf, M. (2002). Continuity of halfspace depth contours and maxi-mum depth estimators: diagnostics of depth-related methods. Journal of MultivariateAnalysis, 83(2):365–388.

Mosler, K. (2013). Depth statistics. In Becker, C., Fried, R., and Kuhnt, S., editors,Robustness and Complex Data Structures, Festschrift in Honour of Ursula Gather,pages 17–34, Berlin, Springer.

Mosler, K. and Mozharovskyi, P. (2014). Fast DD-classification of functional data.arXiv:1403.1158v2.

Paindaveine, D. and Siman, M. (2012). Computing multiple-output regression quantileregions. Computational Statistics & Data Analysis, 56:840–853.

23

http://arxiv.org/abs/1412.5122

http://arxiv.org/abs/1403.1158

Pigoli, D. and Sangalli, L. (2012). Wavelets in functional data analysis: estimationof multidimensional curves and their derivatives. Computational Statistics & DataAnalysis, 56(6):1482–1498.

Ramsay, J. and Silverman, B. (2005). Functional Data Analysis. Springer, New York,2nd edition.

Romanazzi, M. (2001). Influence function of halfspace depth. Journal of MultivariateAnalysis, 77:138–161.

Rossi, F. and Villa, N. (2006). Support vector machine for functional data classification.Neurocomputing, 69:730–742.

Rousseeuw, P. J. and Hubert, M. (1999). Regression depth. Journal of the AmericanStatistical Association, 94:388–402.

Rousseeuw, P. J. and Leroy, A. (1987). Robust Regression and Outlier Detection. Wiley-Interscience, New York.

Rousseeuw, P. J. and Ruts, I. (1996). Bivariate location depth. Applied Statistics,45:516–526.

Rousseeuw, P. J. and Ruts, I. (1998). Constructing the bivariate Tukey median. StatisticaSinica, 8:827–839.

Rousseeuw, P. J. and Ruts, I. (1999). The depth function of a population distribution.Metrika, 49:213–244.

Rousseeuw, P. J., Ruts, I., and Tukey, J. (1999). The bagplot: a bivariate boxplot. TheAmerican Statistician, 53:382–387.

Rousseeuw, P. J. and Struyf, A. (1998). Computing location depth and regression depthin higher dimensions. Statistics and Computing, 8:193–203.

Rousseeuw, P. J. and Struyf, A. (2004). Characterizing angular symmetry and regressionsymmetry. Journal of Statistical Planning and Inference, 122:161–173.

Ruts, I. and Rousseeuw, P. J. (1996). Computing depth contours of bivariate pointclouds. Computational Statistics & Data Analysis, 23:153–168.

Stahel, W. (1981). Robuste Schatzungen: infinitesimale Optimalitat und Schatzungenvon Kovarianzmatrizen. PhD thesis, ETH Zurich.

Struyf, A. and Rousseeuw, P. J. (1999). Halfspace depth and regression depth charac-terize the empirical distribution. Journal of Multivariate Analysis, 69(1):135–153.

Struyf, A. and Rousseeuw, P. J. (2000). High-dimensional computation of the deepestlocation. Computational Statistics & Data Analysis, 34(4):415–426.

24

Tukey, J. (1975). Mathematics and the picturing of data. In Proceedings of the Interna-tional Congress of Mathematicians, Volume 2, pages 523–531, Vancouver.

Zuo, Y. (2003). Projection-based depth functions and associated medians. The Annalsof Statistics, 31(5):1460–1490.

Zuo, Y. and Serfling, R. (2000a). General notions of statistical depth function. TheAnnals of Statistics, 28:461–482.

Zuo, Y. and Serfling, R. (2000b). Structural properties and convergence results forcontours of sample statistical depth functions. The Annals of Statistics, 28(2):483–499.

25

multivariate and functional classi cation using depth and ...€¦ · multivariate and functional...

Documents