estimating the support of a high-dimensional...

29
ARTICLE Communicated by Vladimir Vapnik Estimating the Support of a High-Dimensional Distribution Bernhard Sch ¨ olkopf Microsoft Research Ltd, Cambridge CB2 3NH, U.K. John C. Platt Microsoft Research, Redmond, WA 98052, U.S.A John Shawe-Taylor Royal Holloway, University of London, Egham, Surrey TW20 OEX, U.K. Alex J. Smola Robert C. Williamson Department of Engineering, Australian National University, Canberra 0200, Australia Suppose you are given some data set drawn from an underlying probabil- ity distribution P and you want to estimate a simplesubset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori speci ed value between 0 and 1. We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement. The func- tional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. The expansion coef - cients are found by solving a quadratic programming problem, which we do by carrying out sequential optimization over pairs of input patterns. We also provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabeled data. 1 Introduction During recent years, a new set of kernel techniques for supervised learn- ing has been developed (Vapnik, 1995; Sch¨ olkopf, Burges, & Smola, 1999). Speci cally, support vector (SV) algorithms for pattern recognition, regres- sion estimation, and solution of inverse problems have received consider- able attention. There have been a few attempts to transfer the idea of using kernels to compute inner products in feature spaces to the domain of unsupervised learning. The problems in that domain are, however, less precisely speci- Neural Computation 13, 1443–1471 (2001) c ° 2001 Massachusetts Institute of Technology

Upload: others

Post on 31-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

ARTICLE Communicated by Vladimir Vapnik

Estimating the Support of a High-Dimensional Distribution

Bernhard ScholkopfMicrosoft Research Ltd Cambridge CB2 3NH UK

John C PlattMicrosoft Research Redmond WA 98052 USA

John Shawe-TaylorRoyal Holloway University of London Egham Surrey TW20 OEX UK

Alex J SmolaRobert C WilliamsonDepartment of Engineering Australian National University Canberra 0200 Australia

Suppose you are given some data set drawn from an underlying probabil-ity distribution P and you want to estimate a ldquosimplerdquo subset S of inputspace such that the probability that a test point drawn from P lies outsideof S equals some a priori specied value between 0 and 1

We propose a method to approach this problem by trying to estimate afunction f that is positive on S and negative on the complement The func-tional form of f is given by a kernel expansion in terms of a potentiallysmall subset of the training datait is regularized by controlling the lengthof the weight vector in an associated feature space The expansion coef-cients are found by solving a quadratic programming problem which wedo by carrying out sequential optimization over pairs of input patternsWe also provide a theoretical analysis of the statistical performance ofour algorithm

The algorithm is a natural extension of the support vector algorithmto the case of unlabeled data

1 Introduction

During recent years a new set of kernel techniques for supervised learn-ing has been developed (Vapnik 1995 Scholkopf Burges amp Smola 1999)Specically support vector (SV) algorithms for pattern recognition regres-sion estimation and solution of inverse problems have received consider-able attention

There have been a few attempts to transfer the idea of using kernels tocompute inner products in feature spaces to the domain of unsupervisedlearning The problems in that domain are however less precisely speci-

Neural Computation 13 1443ndash1471 (2001) cdeg 2001 Massachusetts Institute of Technology

1444 Bernhard Scholkopf et al

ed Generally they can be characterized as estimating functions of the datawhich reveal something interesting about the underlying distributions Forinstance kernel principal component analysis (PCA) can be characterized ascomputing functions that on the training data produce unit variance outputswhile having minimum norm in feature space (Scholkopf Smola amp Muller1999) Another kernel-based unsupervised learning technique regularizedprincipal manifolds (Smola Mika Scholkopf amp Williamson in press) com-putes functions that give a mapping onto a lower-dimensional manifoldminimizing a regularized quantization error Clustering algorithms are fur-ther examples of unsupervised learning techniques that can be kernelized(Scholkopf Smola amp Muller 1999)

An extreme point of view is that unsupervised learning is about estimat-ing densities Clearly knowledge of the density of P would then allow usto solve whatever problem can be solved on the basis of the data

The work presented here addresses an easier problem it proposes analgorithm that computes a binary function that is supposed to capture re-gions in input space where the probability density lives (its support) thatis a function such that most of the data will live in the region where thefunction is nonzero (Scholkopf Williamson Smola Shawe-Taylor 1999) Indoing so it is in line with Vapnikrsquos principle never to solve a problem thatis more general than the one we actually need to solve Moreover it is alsoapplicable in cases where the density of the datarsquos distribution is not evenwell dened for example if there are singular components

After a review of some previous work in section 2 we propose SV al-gorithms for the considered problem section 4 gives details on the imple-mentation of the optimization procedure followed by theoretical resultscharacterizing the present approach In section 6 we apply the algorithmto articial as well as real-world data We conclude with a discussion

2 Previous Work

In order to describe some previous work it is convenient to introduce thefollowing denition of a (multidimensional) quantile function introducedby Einmal and Mason (1992) Let x1 x` be independently and identicallydistributed (iid) random variables in a set X with distribution P Let C be aclass of measurable subsets of X and let l be a real-valued function denedon C The quantile function with respect to (P l C ) is

U(a) D inffl(C) P(C) cedil a C 2 C g 0 lt a middot 1

In the special case where P is the empirical distribution (P`(C) D 1`

P`iD1

1C(xi )) we obtain the empirical quantile function We denote by C(a) andC`(a) the (not necessarily unique) C 2 C that attains the inmum (when itis achievable) The most common choice of l is Lebesgue measure in whichcase C(a) is the minimum volume C 2 C that contains at least a fraction a

Estimating the Support of a High-Dimensional Distribution 1445

of the probability mass Estimators of the form C`(a) are called minimumvolume estimators

Observe that for C being all Borel measurable sets C(1) is the supportof the density p corresponding to P assuming it exists (Note that C(1) iswell dened even when p does not exist) For smaller classes C C(1) is theminimum volume C 2 C containing the support of p

Turning to the case where a lt 1 it seems the rst work was reportedby Sager (1979) and then Hartigan (1987) who considered X D R2 with Cbeing the class of closed convex sets in X (They actually considered densitycontour clusters cf appendix A for a denition) Nolan (1991) consideredhigher dimensions with C being the class of ellipsoids Tsybakov (1997)has studied an estimator based on piecewise polynomial approximation ofC(a) and has shown it attains the asymptotically minimax rate for certainclasses of densities p Polonik (1997) has studied the estimation of C(a)by C`(a) He derived asymptotic rates of convergence in terms of variousmeasures of richness of C He considered both VC classes and classes witha log 2 -covering number with bracketing of order O

iexcl2 iexclrcent for r gt 0 More

information on minimum volume estimators can be found in that work andin appendix A

A number of applications have been suggested for these techniques Theyinclude problems in medical diagnosis (Tarassenko Hayton Cerneaz ampBrady 1995) marketing (Ben-David amp Lindenbaum 1997) condition moni-toring of machines (Devroyeamp Wise 1980) estimating manufacturing yields(Stoneking 1999) econometrics and generalized nonlinear principal curves(Tsybakov 1997Korostelev amp Tsybakov 1993) regression and spectral anal-ysis (Polonik 1997) tests for multimodality and clustering (Polonik 1995b)and others (Muller 1992) Most of this work in particular that with a the-oretical slant does not go all the way in devising practical algorithms thatwork on high-dimensional real-world-problems A notable exception to thisis the work of Tarassenko et al (1995)

Polonik (1995a) has shown how one can use estimators of C(a) to con-struct density estimators The point of doing this is that it allows one toencode a range of prior assumptions about the true density p that would beimpossible to do within the traditional density estimation framework Hehas shown asymptotic consistency and rates of convergence for densitiesbelonging to VC-classes or with a known rate of growth of metric entropywith bracketing

Let us conclude this section with a short discussion of how the workpresented here relates to the above This article describes an algorithm thatnds regions close to C(a) Our class C is dened implicitly via a kernel kas the set of half-spaces in an SV feature space We do not try to minimizethe volume of C in input space Instead we minimize an SV-style regular-izer that using a kernel controls the smoothness of the estimated functiondescribing C In terms of multidimensional quantiles our approach can bethought of as employing l(Cw) D kwk2 where Cw D fx fw(x) cedil rg Here

1446 Bernhard Scholkopf et al

(w r ) are a weight vector and an offset parameterizing a hyperplane in thefeature space associated with the kernel

The main contribution of this work is that we propose an algorithm thathas tractable computational complexity even in high-dimensional casesOur theory which uses tools very similar to those used by Polonik givesresults that we expect will be of more use in a nite sample size setting

3 Algorithms

We rst introduce terminology and notation conventions We consider train-ing data

x1 x` 2 X (31)

where ` 2 N is the number of observations and X is some set For simplicitywe think of it as a compact subset of RN Let W be a feature map X Fthat is a map into an inner product space F such that the inner product inthe image of W can be computed by evaluating some simple kernel (BoserGuyon amp Vapnik (1992) Vapnik (1995) Scholkopf Burges et al (1999))

k(x y) D (W (x) cent W (y) ) (32)

such as the gaussian kernel

k(x y) D eiexclkxiexclyk2c (33)

Indices i and j are understood to range over 1 ` (in compact notationi j 2 [ ]) Boldface Greek letters denote -dimensional vectors whose com-ponents are labeled using a normal typeface

In the remainder of this section we develop an algorithm that returnsa function f that takes the value C1 in a ldquosmallrdquo region capturing mostof the data points and iexcl1 elsewhere Our strategy is to map the data intothe feature space corresponding to the kernel and to separate them fromthe origin with maximum margin For a new point x the value f (x) isdetermined by evaluating which side of the hyperplane it falls on in featurespace Via the freedom to use different types of kernel functions this simplegeometric picture corresponds to a variety of nonlinear estimators in inputspace

To separate the data set from the origin we solve the following quadraticprogram

minw2Fraquo2R`r2R

12kwk2 C 1

ordm`

Pi ji iexcl r (34)

subject to (w cent W (xi) ) cedil r iexclji ji cedil 0 (35)

Here ordm 2 (0 1] is a parameter whose meaning will become clear later

Estimating the Support of a High-Dimensional Distribution 1447

Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function

f (x) D sgn((w cent W (x) ) iexcl r ) (36)

will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm

Using multipliers ai macri cedil 0 we introduce a Lagrangian

L(w raquo r reg macr) D12

kwk2 C1

ordm`

X

iji iexcl r

iexclX

iai ( (w cent W (xi) ) iexcl r C ji) iexcl

X

imacriji (37)

and set the derivatives with respect to the primal variables w raquo r equal tozero yielding

w DX

iaiW (xi) (38)

ai D1

ordm`iexcl macri middot

1ordm`

X

iai D 1 (39)

In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion

f (x) D sgn

sup3X

iaik(xi x) iexcl r

acute (310)

Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem

minreg

12

X

ij

aiajk(xi xj) subject to 0 middot ai middot1

ordm`

X

iai D 1 (311)

One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises

r D (w cent W (xi) ) DX

j

ajk(xj xi) (312)

1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise

1448 Bernhard Scholkopf et al

Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP

i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged

It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm

To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)

minR2Rraquo2R`c2F

R2 C1

ordm`

X

iji

subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)

This leads to the dual

minreg

X

ij

aiajk(xi xj) iexclX

iaik(xi xi) (314)

subject to 0 middot ai middot1

ordm`

X

iai D 1 (315)

and the solution

c DX

iaiW (xi) (316)

corresponding to a decision function of the form

f (x) D sgn

0

R2 iexclX

ij

aiajk(xi xj) C 2X

iaik(xi x) iexcl k(x x)

1

A (317)

Estimating the Support of a High-Dimensional Distribution 1449

Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero

For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment

4 Optimization

Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O

iexcl 3cent(cf Platt 1999) The algorithm is a modied version

of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)

The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables

41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to

mina1 a2

12

2X

i jD1

aiajKij C2X

iD1

aiCi C C (41)

1450 Bernhard Scholkopf et al

with Ci DP`

jD3 ajKij and C DP`

i jD3 aiajKij subject to

0 middot a1 a2 middot1

ordm`

2X

iD1

ai D D (42)

where D D 1 iexclP`

iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to

obtain

mina2

12

(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12

a22K22

C (Diexcl a2 )C1 C a2C2 (43)

with the derivative

iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)

Setting this to zero and solving for a2 we get

a2 DD(K11 iexcl K12) C C1 iexcl C2

K11 C K22 iexcl 2K12 (45)

Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1

The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of

the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren

1 acurren2 denote the values of their Lagrange parameter

before the step Then the corresponding outputs (cf equation 310) read

Oi D K1iacurren1 C K2ia

curren2 C Ci (46)

Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren

1

a2 D acurren2 C

O1 iexcl O2

K11 C K22 iexcl 2K12 (47)

which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction

Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization

Estimating the Support of a High-Dimensional Distribution 1451

42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that

Pi ai D 1 Moreover

we set the initial r to maxfOi i 2 [ ] ai gt 0g

43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position

We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to

j D arg maxn2SVnb

|Oi iexcl On | (48)

We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of

the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates

In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below

In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)

We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end

In the next section it will be argued that subtracting something from r

is actually advisable also from a statistical point of view

2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 2: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1444 Bernhard Scholkopf et al

ed Generally they can be characterized as estimating functions of the datawhich reveal something interesting about the underlying distributions Forinstance kernel principal component analysis (PCA) can be characterized ascomputing functions that on the training data produce unit variance outputswhile having minimum norm in feature space (Scholkopf Smola amp Muller1999) Another kernel-based unsupervised learning technique regularizedprincipal manifolds (Smola Mika Scholkopf amp Williamson in press) com-putes functions that give a mapping onto a lower-dimensional manifoldminimizing a regularized quantization error Clustering algorithms are fur-ther examples of unsupervised learning techniques that can be kernelized(Scholkopf Smola amp Muller 1999)

An extreme point of view is that unsupervised learning is about estimat-ing densities Clearly knowledge of the density of P would then allow usto solve whatever problem can be solved on the basis of the data

The work presented here addresses an easier problem it proposes analgorithm that computes a binary function that is supposed to capture re-gions in input space where the probability density lives (its support) thatis a function such that most of the data will live in the region where thefunction is nonzero (Scholkopf Williamson Smola Shawe-Taylor 1999) Indoing so it is in line with Vapnikrsquos principle never to solve a problem thatis more general than the one we actually need to solve Moreover it is alsoapplicable in cases where the density of the datarsquos distribution is not evenwell dened for example if there are singular components

After a review of some previous work in section 2 we propose SV al-gorithms for the considered problem section 4 gives details on the imple-mentation of the optimization procedure followed by theoretical resultscharacterizing the present approach In section 6 we apply the algorithmto articial as well as real-world data We conclude with a discussion

2 Previous Work

In order to describe some previous work it is convenient to introduce thefollowing denition of a (multidimensional) quantile function introducedby Einmal and Mason (1992) Let x1 x` be independently and identicallydistributed (iid) random variables in a set X with distribution P Let C be aclass of measurable subsets of X and let l be a real-valued function denedon C The quantile function with respect to (P l C ) is

U(a) D inffl(C) P(C) cedil a C 2 C g 0 lt a middot 1

In the special case where P is the empirical distribution (P`(C) D 1`

P`iD1

1C(xi )) we obtain the empirical quantile function We denote by C(a) andC`(a) the (not necessarily unique) C 2 C that attains the inmum (when itis achievable) The most common choice of l is Lebesgue measure in whichcase C(a) is the minimum volume C 2 C that contains at least a fraction a

Estimating the Support of a High-Dimensional Distribution 1445

of the probability mass Estimators of the form C`(a) are called minimumvolume estimators

Observe that for C being all Borel measurable sets C(1) is the supportof the density p corresponding to P assuming it exists (Note that C(1) iswell dened even when p does not exist) For smaller classes C C(1) is theminimum volume C 2 C containing the support of p

Turning to the case where a lt 1 it seems the rst work was reportedby Sager (1979) and then Hartigan (1987) who considered X D R2 with Cbeing the class of closed convex sets in X (They actually considered densitycontour clusters cf appendix A for a denition) Nolan (1991) consideredhigher dimensions with C being the class of ellipsoids Tsybakov (1997)has studied an estimator based on piecewise polynomial approximation ofC(a) and has shown it attains the asymptotically minimax rate for certainclasses of densities p Polonik (1997) has studied the estimation of C(a)by C`(a) He derived asymptotic rates of convergence in terms of variousmeasures of richness of C He considered both VC classes and classes witha log 2 -covering number with bracketing of order O

iexcl2 iexclrcent for r gt 0 More

information on minimum volume estimators can be found in that work andin appendix A

A number of applications have been suggested for these techniques Theyinclude problems in medical diagnosis (Tarassenko Hayton Cerneaz ampBrady 1995) marketing (Ben-David amp Lindenbaum 1997) condition moni-toring of machines (Devroyeamp Wise 1980) estimating manufacturing yields(Stoneking 1999) econometrics and generalized nonlinear principal curves(Tsybakov 1997Korostelev amp Tsybakov 1993) regression and spectral anal-ysis (Polonik 1997) tests for multimodality and clustering (Polonik 1995b)and others (Muller 1992) Most of this work in particular that with a the-oretical slant does not go all the way in devising practical algorithms thatwork on high-dimensional real-world-problems A notable exception to thisis the work of Tarassenko et al (1995)

Polonik (1995a) has shown how one can use estimators of C(a) to con-struct density estimators The point of doing this is that it allows one toencode a range of prior assumptions about the true density p that would beimpossible to do within the traditional density estimation framework Hehas shown asymptotic consistency and rates of convergence for densitiesbelonging to VC-classes or with a known rate of growth of metric entropywith bracketing

Let us conclude this section with a short discussion of how the workpresented here relates to the above This article describes an algorithm thatnds regions close to C(a) Our class C is dened implicitly via a kernel kas the set of half-spaces in an SV feature space We do not try to minimizethe volume of C in input space Instead we minimize an SV-style regular-izer that using a kernel controls the smoothness of the estimated functiondescribing C In terms of multidimensional quantiles our approach can bethought of as employing l(Cw) D kwk2 where Cw D fx fw(x) cedil rg Here

1446 Bernhard Scholkopf et al

(w r ) are a weight vector and an offset parameterizing a hyperplane in thefeature space associated with the kernel

The main contribution of this work is that we propose an algorithm thathas tractable computational complexity even in high-dimensional casesOur theory which uses tools very similar to those used by Polonik givesresults that we expect will be of more use in a nite sample size setting

3 Algorithms

We rst introduce terminology and notation conventions We consider train-ing data

x1 x` 2 X (31)

where ` 2 N is the number of observations and X is some set For simplicitywe think of it as a compact subset of RN Let W be a feature map X Fthat is a map into an inner product space F such that the inner product inthe image of W can be computed by evaluating some simple kernel (BoserGuyon amp Vapnik (1992) Vapnik (1995) Scholkopf Burges et al (1999))

k(x y) D (W (x) cent W (y) ) (32)

such as the gaussian kernel

k(x y) D eiexclkxiexclyk2c (33)

Indices i and j are understood to range over 1 ` (in compact notationi j 2 [ ]) Boldface Greek letters denote -dimensional vectors whose com-ponents are labeled using a normal typeface

In the remainder of this section we develop an algorithm that returnsa function f that takes the value C1 in a ldquosmallrdquo region capturing mostof the data points and iexcl1 elsewhere Our strategy is to map the data intothe feature space corresponding to the kernel and to separate them fromthe origin with maximum margin For a new point x the value f (x) isdetermined by evaluating which side of the hyperplane it falls on in featurespace Via the freedom to use different types of kernel functions this simplegeometric picture corresponds to a variety of nonlinear estimators in inputspace

To separate the data set from the origin we solve the following quadraticprogram

minw2Fraquo2R`r2R

12kwk2 C 1

ordm`

Pi ji iexcl r (34)

subject to (w cent W (xi) ) cedil r iexclji ji cedil 0 (35)

Here ordm 2 (0 1] is a parameter whose meaning will become clear later

Estimating the Support of a High-Dimensional Distribution 1447

Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function

f (x) D sgn((w cent W (x) ) iexcl r ) (36)

will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm

Using multipliers ai macri cedil 0 we introduce a Lagrangian

L(w raquo r reg macr) D12

kwk2 C1

ordm`

X

iji iexcl r

iexclX

iai ( (w cent W (xi) ) iexcl r C ji) iexcl

X

imacriji (37)

and set the derivatives with respect to the primal variables w raquo r equal tozero yielding

w DX

iaiW (xi) (38)

ai D1

ordm`iexcl macri middot

1ordm`

X

iai D 1 (39)

In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion

f (x) D sgn

sup3X

iaik(xi x) iexcl r

acute (310)

Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem

minreg

12

X

ij

aiajk(xi xj) subject to 0 middot ai middot1

ordm`

X

iai D 1 (311)

One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises

r D (w cent W (xi) ) DX

j

ajk(xj xi) (312)

1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise

1448 Bernhard Scholkopf et al

Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP

i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged

It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm

To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)

minR2Rraquo2R`c2F

R2 C1

ordm`

X

iji

subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)

This leads to the dual

minreg

X

ij

aiajk(xi xj) iexclX

iaik(xi xi) (314)

subject to 0 middot ai middot1

ordm`

X

iai D 1 (315)

and the solution

c DX

iaiW (xi) (316)

corresponding to a decision function of the form

f (x) D sgn

0

R2 iexclX

ij

aiajk(xi xj) C 2X

iaik(xi x) iexcl k(x x)

1

A (317)

Estimating the Support of a High-Dimensional Distribution 1449

Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero

For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment

4 Optimization

Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O

iexcl 3cent(cf Platt 1999) The algorithm is a modied version

of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)

The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables

41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to

mina1 a2

12

2X

i jD1

aiajKij C2X

iD1

aiCi C C (41)

1450 Bernhard Scholkopf et al

with Ci DP`

jD3 ajKij and C DP`

i jD3 aiajKij subject to

0 middot a1 a2 middot1

ordm`

2X

iD1

ai D D (42)

where D D 1 iexclP`

iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to

obtain

mina2

12

(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12

a22K22

C (Diexcl a2 )C1 C a2C2 (43)

with the derivative

iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)

Setting this to zero and solving for a2 we get

a2 DD(K11 iexcl K12) C C1 iexcl C2

K11 C K22 iexcl 2K12 (45)

Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1

The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of

the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren

1 acurren2 denote the values of their Lagrange parameter

before the step Then the corresponding outputs (cf equation 310) read

Oi D K1iacurren1 C K2ia

curren2 C Ci (46)

Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren

1

a2 D acurren2 C

O1 iexcl O2

K11 C K22 iexcl 2K12 (47)

which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction

Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization

Estimating the Support of a High-Dimensional Distribution 1451

42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that

Pi ai D 1 Moreover

we set the initial r to maxfOi i 2 [ ] ai gt 0g

43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position

We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to

j D arg maxn2SVnb

|Oi iexcl On | (48)

We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of

the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates

In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below

In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)

We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end

In the next section it will be argued that subtracting something from r

is actually advisable also from a statistical point of view

2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 3: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1445

of the probability mass Estimators of the form C`(a) are called minimumvolume estimators

Observe that for C being all Borel measurable sets C(1) is the supportof the density p corresponding to P assuming it exists (Note that C(1) iswell dened even when p does not exist) For smaller classes C C(1) is theminimum volume C 2 C containing the support of p

Turning to the case where a lt 1 it seems the rst work was reportedby Sager (1979) and then Hartigan (1987) who considered X D R2 with Cbeing the class of closed convex sets in X (They actually considered densitycontour clusters cf appendix A for a denition) Nolan (1991) consideredhigher dimensions with C being the class of ellipsoids Tsybakov (1997)has studied an estimator based on piecewise polynomial approximation ofC(a) and has shown it attains the asymptotically minimax rate for certainclasses of densities p Polonik (1997) has studied the estimation of C(a)by C`(a) He derived asymptotic rates of convergence in terms of variousmeasures of richness of C He considered both VC classes and classes witha log 2 -covering number with bracketing of order O

iexcl2 iexclrcent for r gt 0 More

information on minimum volume estimators can be found in that work andin appendix A

A number of applications have been suggested for these techniques Theyinclude problems in medical diagnosis (Tarassenko Hayton Cerneaz ampBrady 1995) marketing (Ben-David amp Lindenbaum 1997) condition moni-toring of machines (Devroyeamp Wise 1980) estimating manufacturing yields(Stoneking 1999) econometrics and generalized nonlinear principal curves(Tsybakov 1997Korostelev amp Tsybakov 1993) regression and spectral anal-ysis (Polonik 1997) tests for multimodality and clustering (Polonik 1995b)and others (Muller 1992) Most of this work in particular that with a the-oretical slant does not go all the way in devising practical algorithms thatwork on high-dimensional real-world-problems A notable exception to thisis the work of Tarassenko et al (1995)

Polonik (1995a) has shown how one can use estimators of C(a) to con-struct density estimators The point of doing this is that it allows one toencode a range of prior assumptions about the true density p that would beimpossible to do within the traditional density estimation framework Hehas shown asymptotic consistency and rates of convergence for densitiesbelonging to VC-classes or with a known rate of growth of metric entropywith bracketing

Let us conclude this section with a short discussion of how the workpresented here relates to the above This article describes an algorithm thatnds regions close to C(a) Our class C is dened implicitly via a kernel kas the set of half-spaces in an SV feature space We do not try to minimizethe volume of C in input space Instead we minimize an SV-style regular-izer that using a kernel controls the smoothness of the estimated functiondescribing C In terms of multidimensional quantiles our approach can bethought of as employing l(Cw) D kwk2 where Cw D fx fw(x) cedil rg Here

1446 Bernhard Scholkopf et al

(w r ) are a weight vector and an offset parameterizing a hyperplane in thefeature space associated with the kernel

The main contribution of this work is that we propose an algorithm thathas tractable computational complexity even in high-dimensional casesOur theory which uses tools very similar to those used by Polonik givesresults that we expect will be of more use in a nite sample size setting

3 Algorithms

We rst introduce terminology and notation conventions We consider train-ing data

x1 x` 2 X (31)

where ` 2 N is the number of observations and X is some set For simplicitywe think of it as a compact subset of RN Let W be a feature map X Fthat is a map into an inner product space F such that the inner product inthe image of W can be computed by evaluating some simple kernel (BoserGuyon amp Vapnik (1992) Vapnik (1995) Scholkopf Burges et al (1999))

k(x y) D (W (x) cent W (y) ) (32)

such as the gaussian kernel

k(x y) D eiexclkxiexclyk2c (33)

Indices i and j are understood to range over 1 ` (in compact notationi j 2 [ ]) Boldface Greek letters denote -dimensional vectors whose com-ponents are labeled using a normal typeface

In the remainder of this section we develop an algorithm that returnsa function f that takes the value C1 in a ldquosmallrdquo region capturing mostof the data points and iexcl1 elsewhere Our strategy is to map the data intothe feature space corresponding to the kernel and to separate them fromthe origin with maximum margin For a new point x the value f (x) isdetermined by evaluating which side of the hyperplane it falls on in featurespace Via the freedom to use different types of kernel functions this simplegeometric picture corresponds to a variety of nonlinear estimators in inputspace

To separate the data set from the origin we solve the following quadraticprogram

minw2Fraquo2R`r2R

12kwk2 C 1

ordm`

Pi ji iexcl r (34)

subject to (w cent W (xi) ) cedil r iexclji ji cedil 0 (35)

Here ordm 2 (0 1] is a parameter whose meaning will become clear later

Estimating the Support of a High-Dimensional Distribution 1447

Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function

f (x) D sgn((w cent W (x) ) iexcl r ) (36)

will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm

Using multipliers ai macri cedil 0 we introduce a Lagrangian

L(w raquo r reg macr) D12

kwk2 C1

ordm`

X

iji iexcl r

iexclX

iai ( (w cent W (xi) ) iexcl r C ji) iexcl

X

imacriji (37)

and set the derivatives with respect to the primal variables w raquo r equal tozero yielding

w DX

iaiW (xi) (38)

ai D1

ordm`iexcl macri middot

1ordm`

X

iai D 1 (39)

In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion

f (x) D sgn

sup3X

iaik(xi x) iexcl r

acute (310)

Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem

minreg

12

X

ij

aiajk(xi xj) subject to 0 middot ai middot1

ordm`

X

iai D 1 (311)

One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises

r D (w cent W (xi) ) DX

j

ajk(xj xi) (312)

1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise

1448 Bernhard Scholkopf et al

Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP

i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged

It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm

To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)

minR2Rraquo2R`c2F

R2 C1

ordm`

X

iji

subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)

This leads to the dual

minreg

X

ij

aiajk(xi xj) iexclX

iaik(xi xi) (314)

subject to 0 middot ai middot1

ordm`

X

iai D 1 (315)

and the solution

c DX

iaiW (xi) (316)

corresponding to a decision function of the form

f (x) D sgn

0

R2 iexclX

ij

aiajk(xi xj) C 2X

iaik(xi x) iexcl k(x x)

1

A (317)

Estimating the Support of a High-Dimensional Distribution 1449

Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero

For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment

4 Optimization

Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O

iexcl 3cent(cf Platt 1999) The algorithm is a modied version

of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)

The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables

41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to

mina1 a2

12

2X

i jD1

aiajKij C2X

iD1

aiCi C C (41)

1450 Bernhard Scholkopf et al

with Ci DP`

jD3 ajKij and C DP`

i jD3 aiajKij subject to

0 middot a1 a2 middot1

ordm`

2X

iD1

ai D D (42)

where D D 1 iexclP`

iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to

obtain

mina2

12

(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12

a22K22

C (Diexcl a2 )C1 C a2C2 (43)

with the derivative

iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)

Setting this to zero and solving for a2 we get

a2 DD(K11 iexcl K12) C C1 iexcl C2

K11 C K22 iexcl 2K12 (45)

Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1

The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of

the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren

1 acurren2 denote the values of their Lagrange parameter

before the step Then the corresponding outputs (cf equation 310) read

Oi D K1iacurren1 C K2ia

curren2 C Ci (46)

Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren

1

a2 D acurren2 C

O1 iexcl O2

K11 C K22 iexcl 2K12 (47)

which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction

Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization

Estimating the Support of a High-Dimensional Distribution 1451

42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that

Pi ai D 1 Moreover

we set the initial r to maxfOi i 2 [ ] ai gt 0g

43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position

We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to

j D arg maxn2SVnb

|Oi iexcl On | (48)

We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of

the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates

In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below

In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)

We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end

In the next section it will be argued that subtracting something from r

is actually advisable also from a statistical point of view

2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 4: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1446 Bernhard Scholkopf et al

(w r ) are a weight vector and an offset parameterizing a hyperplane in thefeature space associated with the kernel

The main contribution of this work is that we propose an algorithm thathas tractable computational complexity even in high-dimensional casesOur theory which uses tools very similar to those used by Polonik givesresults that we expect will be of more use in a nite sample size setting

3 Algorithms

We rst introduce terminology and notation conventions We consider train-ing data

x1 x` 2 X (31)

where ` 2 N is the number of observations and X is some set For simplicitywe think of it as a compact subset of RN Let W be a feature map X Fthat is a map into an inner product space F such that the inner product inthe image of W can be computed by evaluating some simple kernel (BoserGuyon amp Vapnik (1992) Vapnik (1995) Scholkopf Burges et al (1999))

k(x y) D (W (x) cent W (y) ) (32)

such as the gaussian kernel

k(x y) D eiexclkxiexclyk2c (33)

Indices i and j are understood to range over 1 ` (in compact notationi j 2 [ ]) Boldface Greek letters denote -dimensional vectors whose com-ponents are labeled using a normal typeface

In the remainder of this section we develop an algorithm that returnsa function f that takes the value C1 in a ldquosmallrdquo region capturing mostof the data points and iexcl1 elsewhere Our strategy is to map the data intothe feature space corresponding to the kernel and to separate them fromthe origin with maximum margin For a new point x the value f (x) isdetermined by evaluating which side of the hyperplane it falls on in featurespace Via the freedom to use different types of kernel functions this simplegeometric picture corresponds to a variety of nonlinear estimators in inputspace

To separate the data set from the origin we solve the following quadraticprogram

minw2Fraquo2R`r2R

12kwk2 C 1

ordm`

Pi ji iexcl r (34)

subject to (w cent W (xi) ) cedil r iexclji ji cedil 0 (35)

Here ordm 2 (0 1] is a parameter whose meaning will become clear later

Estimating the Support of a High-Dimensional Distribution 1447

Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function

f (x) D sgn((w cent W (x) ) iexcl r ) (36)

will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm

Using multipliers ai macri cedil 0 we introduce a Lagrangian

L(w raquo r reg macr) D12

kwk2 C1

ordm`

X

iji iexcl r

iexclX

iai ( (w cent W (xi) ) iexcl r C ji) iexcl

X

imacriji (37)

and set the derivatives with respect to the primal variables w raquo r equal tozero yielding

w DX

iaiW (xi) (38)

ai D1

ordm`iexcl macri middot

1ordm`

X

iai D 1 (39)

In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion

f (x) D sgn

sup3X

iaik(xi x) iexcl r

acute (310)

Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem

minreg

12

X

ij

aiajk(xi xj) subject to 0 middot ai middot1

ordm`

X

iai D 1 (311)

One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises

r D (w cent W (xi) ) DX

j

ajk(xj xi) (312)

1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise

1448 Bernhard Scholkopf et al

Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP

i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged

It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm

To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)

minR2Rraquo2R`c2F

R2 C1

ordm`

X

iji

subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)

This leads to the dual

minreg

X

ij

aiajk(xi xj) iexclX

iaik(xi xi) (314)

subject to 0 middot ai middot1

ordm`

X

iai D 1 (315)

and the solution

c DX

iaiW (xi) (316)

corresponding to a decision function of the form

f (x) D sgn

0

R2 iexclX

ij

aiajk(xi xj) C 2X

iaik(xi x) iexcl k(x x)

1

A (317)

Estimating the Support of a High-Dimensional Distribution 1449

Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero

For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment

4 Optimization

Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O

iexcl 3cent(cf Platt 1999) The algorithm is a modied version

of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)

The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables

41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to

mina1 a2

12

2X

i jD1

aiajKij C2X

iD1

aiCi C C (41)

1450 Bernhard Scholkopf et al

with Ci DP`

jD3 ajKij and C DP`

i jD3 aiajKij subject to

0 middot a1 a2 middot1

ordm`

2X

iD1

ai D D (42)

where D D 1 iexclP`

iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to

obtain

mina2

12

(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12

a22K22

C (Diexcl a2 )C1 C a2C2 (43)

with the derivative

iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)

Setting this to zero and solving for a2 we get

a2 DD(K11 iexcl K12) C C1 iexcl C2

K11 C K22 iexcl 2K12 (45)

Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1

The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of

the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren

1 acurren2 denote the values of their Lagrange parameter

before the step Then the corresponding outputs (cf equation 310) read

Oi D K1iacurren1 C K2ia

curren2 C Ci (46)

Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren

1

a2 D acurren2 C

O1 iexcl O2

K11 C K22 iexcl 2K12 (47)

which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction

Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization

Estimating the Support of a High-Dimensional Distribution 1451

42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that

Pi ai D 1 Moreover

we set the initial r to maxfOi i 2 [ ] ai gt 0g

43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position

We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to

j D arg maxn2SVnb

|Oi iexcl On | (48)

We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of

the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates

In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below

In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)

We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end

In the next section it will be argued that subtracting something from r

is actually advisable also from a statistical point of view

2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 5: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1447

Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function

f (x) D sgn((w cent W (x) ) iexcl r ) (36)

will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm

Using multipliers ai macri cedil 0 we introduce a Lagrangian

L(w raquo r reg macr) D12

kwk2 C1

ordm`

X

iji iexcl r

iexclX

iai ( (w cent W (xi) ) iexcl r C ji) iexcl

X

imacriji (37)

and set the derivatives with respect to the primal variables w raquo r equal tozero yielding

w DX

iaiW (xi) (38)

ai D1

ordm`iexcl macri middot

1ordm`

X

iai D 1 (39)

In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion

f (x) D sgn

sup3X

iaik(xi x) iexcl r

acute (310)

Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem

minreg

12

X

ij

aiajk(xi xj) subject to 0 middot ai middot1

ordm`

X

iai D 1 (311)

One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises

r D (w cent W (xi) ) DX

j

ajk(xj xi) (312)

1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise

1448 Bernhard Scholkopf et al

Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP

i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged

It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm

To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)

minR2Rraquo2R`c2F

R2 C1

ordm`

X

iji

subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)

This leads to the dual

minreg

X

ij

aiajk(xi xj) iexclX

iaik(xi xi) (314)

subject to 0 middot ai middot1

ordm`

X

iai D 1 (315)

and the solution

c DX

iaiW (xi) (316)

corresponding to a decision function of the form

f (x) D sgn

0

R2 iexclX

ij

aiajk(xi xj) C 2X

iaik(xi x) iexcl k(x x)

1

A (317)

Estimating the Support of a High-Dimensional Distribution 1449

Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero

For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment

4 Optimization

Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O

iexcl 3cent(cf Platt 1999) The algorithm is a modied version

of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)

The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables

41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to

mina1 a2

12

2X

i jD1

aiajKij C2X

iD1

aiCi C C (41)

1450 Bernhard Scholkopf et al

with Ci DP`

jD3 ajKij and C DP`

i jD3 aiajKij subject to

0 middot a1 a2 middot1

ordm`

2X

iD1

ai D D (42)

where D D 1 iexclP`

iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to

obtain

mina2

12

(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12

a22K22

C (Diexcl a2 )C1 C a2C2 (43)

with the derivative

iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)

Setting this to zero and solving for a2 we get

a2 DD(K11 iexcl K12) C C1 iexcl C2

K11 C K22 iexcl 2K12 (45)

Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1

The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of

the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren

1 acurren2 denote the values of their Lagrange parameter

before the step Then the corresponding outputs (cf equation 310) read

Oi D K1iacurren1 C K2ia

curren2 C Ci (46)

Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren

1

a2 D acurren2 C

O1 iexcl O2

K11 C K22 iexcl 2K12 (47)

which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction

Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization

Estimating the Support of a High-Dimensional Distribution 1451

42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that

Pi ai D 1 Moreover

we set the initial r to maxfOi i 2 [ ] ai gt 0g

43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position

We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to

j D arg maxn2SVnb

|Oi iexcl On | (48)

We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of

the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates

In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below

In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)

We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end

In the next section it will be argued that subtracting something from r

is actually advisable also from a statistical point of view

2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 6: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1448 Bernhard Scholkopf et al

Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP

i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged

It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm

To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)

minR2Rraquo2R`c2F

R2 C1

ordm`

X

iji

subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)

This leads to the dual

minreg

X

ij

aiajk(xi xj) iexclX

iaik(xi xi) (314)

subject to 0 middot ai middot1

ordm`

X

iai D 1 (315)

and the solution

c DX

iaiW (xi) (316)

corresponding to a decision function of the form

f (x) D sgn

0

R2 iexclX

ij

aiajk(xi xj) C 2X

iaik(xi x) iexcl k(x x)

1

A (317)

Estimating the Support of a High-Dimensional Distribution 1449

Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero

For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment

4 Optimization

Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O

iexcl 3cent(cf Platt 1999) The algorithm is a modied version

of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)

The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables

41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to

mina1 a2

12

2X

i jD1

aiajKij C2X

iD1

aiCi C C (41)

1450 Bernhard Scholkopf et al

with Ci DP`

jD3 ajKij and C DP`

i jD3 aiajKij subject to

0 middot a1 a2 middot1

ordm`

2X

iD1

ai D D (42)

where D D 1 iexclP`

iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to

obtain

mina2

12

(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12

a22K22

C (Diexcl a2 )C1 C a2C2 (43)

with the derivative

iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)

Setting this to zero and solving for a2 we get

a2 DD(K11 iexcl K12) C C1 iexcl C2

K11 C K22 iexcl 2K12 (45)

Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1

The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of

the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren

1 acurren2 denote the values of their Lagrange parameter

before the step Then the corresponding outputs (cf equation 310) read

Oi D K1iacurren1 C K2ia

curren2 C Ci (46)

Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren

1

a2 D acurren2 C

O1 iexcl O2

K11 C K22 iexcl 2K12 (47)

which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction

Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization

Estimating the Support of a High-Dimensional Distribution 1451

42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that

Pi ai D 1 Moreover

we set the initial r to maxfOi i 2 [ ] ai gt 0g

43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position

We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to

j D arg maxn2SVnb

|Oi iexcl On | (48)

We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of

the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates

In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below

In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)

We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end

In the next section it will be argued that subtracting something from r

is actually advisable also from a statistical point of view

2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 7: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1449

Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero

For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment

4 Optimization

Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O

iexcl 3cent(cf Platt 1999) The algorithm is a modied version

of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)

The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables

41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to

mina1 a2

12

2X

i jD1

aiajKij C2X

iD1

aiCi C C (41)

1450 Bernhard Scholkopf et al

with Ci DP`

jD3 ajKij and C DP`

i jD3 aiajKij subject to

0 middot a1 a2 middot1

ordm`

2X

iD1

ai D D (42)

where D D 1 iexclP`

iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to

obtain

mina2

12

(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12

a22K22

C (Diexcl a2 )C1 C a2C2 (43)

with the derivative

iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)

Setting this to zero and solving for a2 we get

a2 DD(K11 iexcl K12) C C1 iexcl C2

K11 C K22 iexcl 2K12 (45)

Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1

The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of

the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren

1 acurren2 denote the values of their Lagrange parameter

before the step Then the corresponding outputs (cf equation 310) read

Oi D K1iacurren1 C K2ia

curren2 C Ci (46)

Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren

1

a2 D acurren2 C

O1 iexcl O2

K11 C K22 iexcl 2K12 (47)

which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction

Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization

Estimating the Support of a High-Dimensional Distribution 1451

42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that

Pi ai D 1 Moreover

we set the initial r to maxfOi i 2 [ ] ai gt 0g

43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position

We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to

j D arg maxn2SVnb

|Oi iexcl On | (48)

We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of

the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates

In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below

In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)

We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end

In the next section it will be argued that subtracting something from r

is actually advisable also from a statistical point of view

2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 8: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1450 Bernhard Scholkopf et al

with Ci DP`

jD3 ajKij and C DP`

i jD3 aiajKij subject to

0 middot a1 a2 middot1

ordm`

2X

iD1

ai D D (42)

where D D 1 iexclP`

iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to

obtain

mina2

12

(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12

a22K22

C (Diexcl a2 )C1 C a2C2 (43)

with the derivative

iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)

Setting this to zero and solving for a2 we get

a2 DD(K11 iexcl K12) C C1 iexcl C2

K11 C K22 iexcl 2K12 (45)

Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1

The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of

the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren

1 acurren2 denote the values of their Lagrange parameter

before the step Then the corresponding outputs (cf equation 310) read

Oi D K1iacurren1 C K2ia

curren2 C Ci (46)

Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren

1

a2 D acurren2 C

O1 iexcl O2

K11 C K22 iexcl 2K12 (47)

which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction

Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization

Estimating the Support of a High-Dimensional Distribution 1451

42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that

Pi ai D 1 Moreover

we set the initial r to maxfOi i 2 [ ] ai gt 0g

43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position

We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to

j D arg maxn2SVnb

|Oi iexcl On | (48)

We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of

the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates

In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below

In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)

We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end

In the next section it will be argued that subtracting something from r

is actually advisable also from a statistical point of view

2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 9: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1451

42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that

Pi ai D 1 Moreover

we set the initial r to maxfOi i 2 [ ] ai gt 0g

43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position

We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to

j D arg maxn2SVnb

|Oi iexcl On | (48)

We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of

the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates

In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below

In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)

We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end

In the next section it will be argued that subtracting something from r

is actually advisable also from a statistical point of view

2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 10: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1452 Bernhard Scholkopf et al

5 Theory

We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B

In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is

xi D W (xi) (51)

Denition 1 A data set

x1 x` (52)

is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]

If we use a gaussian kernel (see equation 33) then any data set x1 x`

is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin

Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by

minw2F

12

kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)

The following result elucidates the relationship between single-class clas-sication and binary classication

Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set

f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 11: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1453

(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set

copyiexclx1 y1

cent

iexclx` y`

centordf

iexclyi 2 fsect1g for i 2 [ ]

cent (55)

aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set

copyy1x1 y`x`

ordf (56)

Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed

The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case

Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold

i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region

ii ordm is a lower bound on the fraction of SVs

iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers

Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3

Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane

Note that although the hyperplane does not change its parameterization inw and r does

We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 12: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1454 Bernhard Scholkopf et al

lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one

Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene

D (X f h ) DX

x2X

d(x f h )

In the following log denotes logarithms to base 2 and ln denotes naturallogarithms

Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0

Pcopyx0 x0 62 Rwr iexclc

ordfmiddot

2`

sup3k C log

2

2d

acute (57)

where

k Dc1 log

iexclc2 Oc 2`

cent

Oc 2C

2DOc

logsup3

esup3

(2`iexcl 1) Oc2D

C 1acuteacute

C 2 (58)

c1 D 16c2 c2 D ln(2) iexcl4c2cent

c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)

The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc

The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 13: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1455

The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)

We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust

6 Experiments

We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one

Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which

3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 14: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1456 Bernhard Scholkopf et al

ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048

Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points

Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 15: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1457

test train other offset

test train other offset

Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 16: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1458 Bernhard Scholkopf et al

6 9 2 8 1 8 8 6 5 3

2 3 8 7 0 3 0 8 2 7

Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels

captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7

Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers

In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic

In addition to the experiments reported above the algorithm has since

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 17: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1459

9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162

3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48

Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled

Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007

Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)

1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76

10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349

Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 18: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1460 Bernhard Scholkopf et al

Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient

been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)

7 Discussion

One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 19: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1461

than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects

Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel

Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly

An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize

X

i j

aiajkiexclxi xj

centD

X

i j

aiajiexclk (xi cent) cent dxj (cent)

cent

DX

i j

aiajiexclk (xi cent) cent

iexclPcurrenPk

cent iexclxj cent

centcent

DX

i j

aiajiexcliexcl

Pkcent

(xi cent) centiexclPk

cent iexclxj cent

centcent

D

0

sup3

PX

iaik

acute(xi cent) cent

0

PX

j

ajk

1

A iexclxj cent

cent1

A

D kPfk2

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 20: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1462 Bernhard Scholkopf et al

using f (x) DP

i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2

on the function spaceInterestingly as the minimization of the dual objective function also cor-

responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin

The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days

From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space

The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers

We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 21: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1463

possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research

Appendix A Supplementary Material for Section 2

A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply

OC` D[

iD1

B(xi 2 `) (A1)

where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D

copyx Op`(x) gt 0

ordfwhere

Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed

OCplug-inb D

copyx Op`(x) gt b`

ordf where (b`)` is an appropriately chosen se-

quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator

The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)

A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)

Dene the excess mass over C at level a as

E C (a) D sup fHa(C) C 2 C g

where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that

E C (a) D Ha (C C (a) )

is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is

C C (a) D arg max f(P` iexcl al)(C) C 2 C g

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 22: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1464 Bernhard Scholkopf et al

where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a

cp (a) D fx p(x) cedil ag

Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes

Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl

RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set

A( )2 with respect to p by

A( )2 D

copy(x1 x`) 2 S` | iexcl 1

`log p(x1 x`) iexcl h(X) | middot 2

ordf

where p(x1 x`) DQ`

iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`

D b` means lim`11` log a`

b`D 0 Cover and Thomas (1991) show that

for all 2 d lt 12 then

vol A( )2

D vol C`(1 iexcld ) D 2 h

They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)

Appendix B Proofs of Section 5

Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)

Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53

Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 23: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1465

optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)

Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)

Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r

(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]

Proof Sketch (Proposition 3) When changing r the termP

i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)

Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0

o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0

o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0

i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0

o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution

We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R

Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)

Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 24: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1466 Bernhard Scholkopf et al

The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F

k f klX1D max

x2X| f (x) | (B2)

(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )

We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)

Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small

Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0

Praquo

x f (x) lt mini2[ ]

f (xi) iexcl 2c

frac14middot 2 ( k d) D 2

`(k C log `

d)

where k D dlog N (c F 2 )e

The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument

Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here

We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 25: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1467

Let X be an inner product space The following denition introduces anew inner product space based on X

Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by

f cent g DX

x2supp( f )

f (x)g(x)

The 1-norm on L( X ) is dened by k fk1 DP

x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby

dx(y) Draquo

1 if y D xI0 otherwise

For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by

gf (y) D gXhf (y) D

X

x2X

d(x f h )dx(y)

Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh

f 2LB ( X ) (ie

Px2X d(x f h ) middot B)

Pcopyx f (x) lt h iexcl 2c

ordfmiddot 2

`(k C log `

d) (B3)

where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )

uml

The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)

The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions

The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 26: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1468 Bernhard Scholkopf et al

log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples

In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article

Acknowledgments

This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions

References

Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182

Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for

optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press

Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997

Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488

Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 27: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1469

Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46

Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210

Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480

Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270

Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13

Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press

Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore

Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag

Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg

Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371

Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press

Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)

Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81

Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881

Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24

Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339

Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press

Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 28: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

1470 Bernhard Scholkopf et al

Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87

Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research

Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress

Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245

Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany

Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag

Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted

Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag

Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing

Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649

Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning

Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press

Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76

Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge

Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto

Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969

Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag

Vapnik V (1998) Statistical learning theory New York Wiley

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000

Page 29: Estimating the Support of a High-Dimensional Distributionusers.cecs.anu.edu.au/~williams/papers/P132.pdf · 2001-08-21 · ARTICLE CommunicatedbyVladimirVapnik EstimatingtheSupportofaHigh-DimensionalDistribution

Estimating the Support of a High-Dimensional Distribution 1471

Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press

Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)

Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24

Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)

Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman

Received November 19 1999 accepted September 22 2000