estimating the support of a high-dimensional...
TRANSCRIPT
ARTICLE Communicated by Vladimir Vapnik
Estimating the Support of a High-Dimensional Distribution
Bernhard ScholkopfMicrosoft Research Ltd Cambridge CB2 3NH UK
John C PlattMicrosoft Research Redmond WA 98052 USA
John Shawe-TaylorRoyal Holloway University of London Egham Surrey TW20 OEX UK
Alex J SmolaRobert C WilliamsonDepartment of Engineering Australian National University Canberra 0200 Australia
Suppose you are given some data set drawn from an underlying probabil-ity distribution P and you want to estimate a ldquosimplerdquo subset S of inputspace such that the probability that a test point drawn from P lies outsideof S equals some a priori specied value between 0 and 1
We propose a method to approach this problem by trying to estimate afunction f that is positive on S and negative on the complement The func-tional form of f is given by a kernel expansion in terms of a potentiallysmall subset of the training datait is regularized by controlling the lengthof the weight vector in an associated feature space The expansion coef-cients are found by solving a quadratic programming problem which wedo by carrying out sequential optimization over pairs of input patternsWe also provide a theoretical analysis of the statistical performance ofour algorithm
The algorithm is a natural extension of the support vector algorithmto the case of unlabeled data
1 Introduction
During recent years a new set of kernel techniques for supervised learn-ing has been developed (Vapnik 1995 Scholkopf Burges amp Smola 1999)Specically support vector (SV) algorithms for pattern recognition regres-sion estimation and solution of inverse problems have received consider-able attention
There have been a few attempts to transfer the idea of using kernels tocompute inner products in feature spaces to the domain of unsupervisedlearning The problems in that domain are however less precisely speci-
Neural Computation 13 1443ndash1471 (2001) cdeg 2001 Massachusetts Institute of Technology
1444 Bernhard Scholkopf et al
ed Generally they can be characterized as estimating functions of the datawhich reveal something interesting about the underlying distributions Forinstance kernel principal component analysis (PCA) can be characterized ascomputing functions that on the training data produce unit variance outputswhile having minimum norm in feature space (Scholkopf Smola amp Muller1999) Another kernel-based unsupervised learning technique regularizedprincipal manifolds (Smola Mika Scholkopf amp Williamson in press) com-putes functions that give a mapping onto a lower-dimensional manifoldminimizing a regularized quantization error Clustering algorithms are fur-ther examples of unsupervised learning techniques that can be kernelized(Scholkopf Smola amp Muller 1999)
An extreme point of view is that unsupervised learning is about estimat-ing densities Clearly knowledge of the density of P would then allow usto solve whatever problem can be solved on the basis of the data
The work presented here addresses an easier problem it proposes analgorithm that computes a binary function that is supposed to capture re-gions in input space where the probability density lives (its support) thatis a function such that most of the data will live in the region where thefunction is nonzero (Scholkopf Williamson Smola Shawe-Taylor 1999) Indoing so it is in line with Vapnikrsquos principle never to solve a problem thatis more general than the one we actually need to solve Moreover it is alsoapplicable in cases where the density of the datarsquos distribution is not evenwell dened for example if there are singular components
After a review of some previous work in section 2 we propose SV al-gorithms for the considered problem section 4 gives details on the imple-mentation of the optimization procedure followed by theoretical resultscharacterizing the present approach In section 6 we apply the algorithmto articial as well as real-world data We conclude with a discussion
2 Previous Work
In order to describe some previous work it is convenient to introduce thefollowing denition of a (multidimensional) quantile function introducedby Einmal and Mason (1992) Let x1 x` be independently and identicallydistributed (iid) random variables in a set X with distribution P Let C be aclass of measurable subsets of X and let l be a real-valued function denedon C The quantile function with respect to (P l C ) is
U(a) D inffl(C) P(C) cedil a C 2 C g 0 lt a middot 1
In the special case where P is the empirical distribution (P`(C) D 1`
P`iD1
1C(xi )) we obtain the empirical quantile function We denote by C(a) andC`(a) the (not necessarily unique) C 2 C that attains the inmum (when itis achievable) The most common choice of l is Lebesgue measure in whichcase C(a) is the minimum volume C 2 C that contains at least a fraction a
Estimating the Support of a High-Dimensional Distribution 1445
of the probability mass Estimators of the form C`(a) are called minimumvolume estimators
Observe that for C being all Borel measurable sets C(1) is the supportof the density p corresponding to P assuming it exists (Note that C(1) iswell dened even when p does not exist) For smaller classes C C(1) is theminimum volume C 2 C containing the support of p
Turning to the case where a lt 1 it seems the rst work was reportedby Sager (1979) and then Hartigan (1987) who considered X D R2 with Cbeing the class of closed convex sets in X (They actually considered densitycontour clusters cf appendix A for a denition) Nolan (1991) consideredhigher dimensions with C being the class of ellipsoids Tsybakov (1997)has studied an estimator based on piecewise polynomial approximation ofC(a) and has shown it attains the asymptotically minimax rate for certainclasses of densities p Polonik (1997) has studied the estimation of C(a)by C`(a) He derived asymptotic rates of convergence in terms of variousmeasures of richness of C He considered both VC classes and classes witha log 2 -covering number with bracketing of order O
iexcl2 iexclrcent for r gt 0 More
information on minimum volume estimators can be found in that work andin appendix A
A number of applications have been suggested for these techniques Theyinclude problems in medical diagnosis (Tarassenko Hayton Cerneaz ampBrady 1995) marketing (Ben-David amp Lindenbaum 1997) condition moni-toring of machines (Devroyeamp Wise 1980) estimating manufacturing yields(Stoneking 1999) econometrics and generalized nonlinear principal curves(Tsybakov 1997Korostelev amp Tsybakov 1993) regression and spectral anal-ysis (Polonik 1997) tests for multimodality and clustering (Polonik 1995b)and others (Muller 1992) Most of this work in particular that with a the-oretical slant does not go all the way in devising practical algorithms thatwork on high-dimensional real-world-problems A notable exception to thisis the work of Tarassenko et al (1995)
Polonik (1995a) has shown how one can use estimators of C(a) to con-struct density estimators The point of doing this is that it allows one toencode a range of prior assumptions about the true density p that would beimpossible to do within the traditional density estimation framework Hehas shown asymptotic consistency and rates of convergence for densitiesbelonging to VC-classes or with a known rate of growth of metric entropywith bracketing
Let us conclude this section with a short discussion of how the workpresented here relates to the above This article describes an algorithm thatnds regions close to C(a) Our class C is dened implicitly via a kernel kas the set of half-spaces in an SV feature space We do not try to minimizethe volume of C in input space Instead we minimize an SV-style regular-izer that using a kernel controls the smoothness of the estimated functiondescribing C In terms of multidimensional quantiles our approach can bethought of as employing l(Cw) D kwk2 where Cw D fx fw(x) cedil rg Here
1446 Bernhard Scholkopf et al
(w r ) are a weight vector and an offset parameterizing a hyperplane in thefeature space associated with the kernel
The main contribution of this work is that we propose an algorithm thathas tractable computational complexity even in high-dimensional casesOur theory which uses tools very similar to those used by Polonik givesresults that we expect will be of more use in a nite sample size setting
3 Algorithms
We rst introduce terminology and notation conventions We consider train-ing data
x1 x` 2 X (31)
where ` 2 N is the number of observations and X is some set For simplicitywe think of it as a compact subset of RN Let W be a feature map X Fthat is a map into an inner product space F such that the inner product inthe image of W can be computed by evaluating some simple kernel (BoserGuyon amp Vapnik (1992) Vapnik (1995) Scholkopf Burges et al (1999))
k(x y) D (W (x) cent W (y) ) (32)
such as the gaussian kernel
k(x y) D eiexclkxiexclyk2c (33)
Indices i and j are understood to range over 1 ` (in compact notationi j 2 [ ]) Boldface Greek letters denote -dimensional vectors whose com-ponents are labeled using a normal typeface
In the remainder of this section we develop an algorithm that returnsa function f that takes the value C1 in a ldquosmallrdquo region capturing mostof the data points and iexcl1 elsewhere Our strategy is to map the data intothe feature space corresponding to the kernel and to separate them fromthe origin with maximum margin For a new point x the value f (x) isdetermined by evaluating which side of the hyperplane it falls on in featurespace Via the freedom to use different types of kernel functions this simplegeometric picture corresponds to a variety of nonlinear estimators in inputspace
To separate the data set from the origin we solve the following quadraticprogram
minw2Fraquo2R`r2R
12kwk2 C 1
ordm`
Pi ji iexcl r (34)
subject to (w cent W (xi) ) cedil r iexclji ji cedil 0 (35)
Here ordm 2 (0 1] is a parameter whose meaning will become clear later
Estimating the Support of a High-Dimensional Distribution 1447
Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function
f (x) D sgn((w cent W (x) ) iexcl r ) (36)
will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm
Using multipliers ai macri cedil 0 we introduce a Lagrangian
L(w raquo r reg macr) D12
kwk2 C1
ordm`
X
iji iexcl r
iexclX
iai ( (w cent W (xi) ) iexcl r C ji) iexcl
X
imacriji (37)
and set the derivatives with respect to the primal variables w raquo r equal tozero yielding
w DX
iaiW (xi) (38)
ai D1
ordm`iexcl macri middot
1ordm`
X
iai D 1 (39)
In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion
f (x) D sgn
sup3X
iaik(xi x) iexcl r
acute (310)
Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem
minreg
12
X
ij
aiajk(xi xj) subject to 0 middot ai middot1
ordm`
X
iai D 1 (311)
One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises
r D (w cent W (xi) ) DX
j
ajk(xj xi) (312)
1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise
1448 Bernhard Scholkopf et al
Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP
i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged
It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm
To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)
minR2Rraquo2R`c2F
R2 C1
ordm`
X
iji
subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)
This leads to the dual
minreg
X
ij
aiajk(xi xj) iexclX
iaik(xi xi) (314)
subject to 0 middot ai middot1
ordm`
X
iai D 1 (315)
and the solution
c DX
iaiW (xi) (316)
corresponding to a decision function of the form
f (x) D sgn
0
R2 iexclX
ij
aiajk(xi xj) C 2X
iaik(xi x) iexcl k(x x)
1
A (317)
Estimating the Support of a High-Dimensional Distribution 1449
Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero
For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment
4 Optimization
Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O
iexcl 3cent(cf Platt 1999) The algorithm is a modied version
of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)
The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables
41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to
mina1 a2
12
2X
i jD1
aiajKij C2X
iD1
aiCi C C (41)
1450 Bernhard Scholkopf et al
with Ci DP`
jD3 ajKij and C DP`
i jD3 aiajKij subject to
0 middot a1 a2 middot1
ordm`
2X
iD1
ai D D (42)
where D D 1 iexclP`
iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to
obtain
mina2
12
(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12
a22K22
C (Diexcl a2 )C1 C a2C2 (43)
with the derivative
iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)
Setting this to zero and solving for a2 we get
a2 DD(K11 iexcl K12) C C1 iexcl C2
K11 C K22 iexcl 2K12 (45)
Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1
The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of
the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren
1 acurren2 denote the values of their Lagrange parameter
before the step Then the corresponding outputs (cf equation 310) read
Oi D K1iacurren1 C K2ia
curren2 C Ci (46)
Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren
1
a2 D acurren2 C
O1 iexcl O2
K11 C K22 iexcl 2K12 (47)
which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction
Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization
Estimating the Support of a High-Dimensional Distribution 1451
42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that
Pi ai D 1 Moreover
we set the initial r to maxfOi i 2 [ ] ai gt 0g
43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position
We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to
j D arg maxn2SVnb
|Oi iexcl On | (48)
We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of
the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates
In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below
In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)
We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end
In the next section it will be argued that subtracting something from r
is actually advisable also from a statistical point of view
2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1444 Bernhard Scholkopf et al
ed Generally they can be characterized as estimating functions of the datawhich reveal something interesting about the underlying distributions Forinstance kernel principal component analysis (PCA) can be characterized ascomputing functions that on the training data produce unit variance outputswhile having minimum norm in feature space (Scholkopf Smola amp Muller1999) Another kernel-based unsupervised learning technique regularizedprincipal manifolds (Smola Mika Scholkopf amp Williamson in press) com-putes functions that give a mapping onto a lower-dimensional manifoldminimizing a regularized quantization error Clustering algorithms are fur-ther examples of unsupervised learning techniques that can be kernelized(Scholkopf Smola amp Muller 1999)
An extreme point of view is that unsupervised learning is about estimat-ing densities Clearly knowledge of the density of P would then allow usto solve whatever problem can be solved on the basis of the data
The work presented here addresses an easier problem it proposes analgorithm that computes a binary function that is supposed to capture re-gions in input space where the probability density lives (its support) thatis a function such that most of the data will live in the region where thefunction is nonzero (Scholkopf Williamson Smola Shawe-Taylor 1999) Indoing so it is in line with Vapnikrsquos principle never to solve a problem thatis more general than the one we actually need to solve Moreover it is alsoapplicable in cases where the density of the datarsquos distribution is not evenwell dened for example if there are singular components
After a review of some previous work in section 2 we propose SV al-gorithms for the considered problem section 4 gives details on the imple-mentation of the optimization procedure followed by theoretical resultscharacterizing the present approach In section 6 we apply the algorithmto articial as well as real-world data We conclude with a discussion
2 Previous Work
In order to describe some previous work it is convenient to introduce thefollowing denition of a (multidimensional) quantile function introducedby Einmal and Mason (1992) Let x1 x` be independently and identicallydistributed (iid) random variables in a set X with distribution P Let C be aclass of measurable subsets of X and let l be a real-valued function denedon C The quantile function with respect to (P l C ) is
U(a) D inffl(C) P(C) cedil a C 2 C g 0 lt a middot 1
In the special case where P is the empirical distribution (P`(C) D 1`
P`iD1
1C(xi )) we obtain the empirical quantile function We denote by C(a) andC`(a) the (not necessarily unique) C 2 C that attains the inmum (when itis achievable) The most common choice of l is Lebesgue measure in whichcase C(a) is the minimum volume C 2 C that contains at least a fraction a
Estimating the Support of a High-Dimensional Distribution 1445
of the probability mass Estimators of the form C`(a) are called minimumvolume estimators
Observe that for C being all Borel measurable sets C(1) is the supportof the density p corresponding to P assuming it exists (Note that C(1) iswell dened even when p does not exist) For smaller classes C C(1) is theminimum volume C 2 C containing the support of p
Turning to the case where a lt 1 it seems the rst work was reportedby Sager (1979) and then Hartigan (1987) who considered X D R2 with Cbeing the class of closed convex sets in X (They actually considered densitycontour clusters cf appendix A for a denition) Nolan (1991) consideredhigher dimensions with C being the class of ellipsoids Tsybakov (1997)has studied an estimator based on piecewise polynomial approximation ofC(a) and has shown it attains the asymptotically minimax rate for certainclasses of densities p Polonik (1997) has studied the estimation of C(a)by C`(a) He derived asymptotic rates of convergence in terms of variousmeasures of richness of C He considered both VC classes and classes witha log 2 -covering number with bracketing of order O
iexcl2 iexclrcent for r gt 0 More
information on minimum volume estimators can be found in that work andin appendix A
A number of applications have been suggested for these techniques Theyinclude problems in medical diagnosis (Tarassenko Hayton Cerneaz ampBrady 1995) marketing (Ben-David amp Lindenbaum 1997) condition moni-toring of machines (Devroyeamp Wise 1980) estimating manufacturing yields(Stoneking 1999) econometrics and generalized nonlinear principal curves(Tsybakov 1997Korostelev amp Tsybakov 1993) regression and spectral anal-ysis (Polonik 1997) tests for multimodality and clustering (Polonik 1995b)and others (Muller 1992) Most of this work in particular that with a the-oretical slant does not go all the way in devising practical algorithms thatwork on high-dimensional real-world-problems A notable exception to thisis the work of Tarassenko et al (1995)
Polonik (1995a) has shown how one can use estimators of C(a) to con-struct density estimators The point of doing this is that it allows one toencode a range of prior assumptions about the true density p that would beimpossible to do within the traditional density estimation framework Hehas shown asymptotic consistency and rates of convergence for densitiesbelonging to VC-classes or with a known rate of growth of metric entropywith bracketing
Let us conclude this section with a short discussion of how the workpresented here relates to the above This article describes an algorithm thatnds regions close to C(a) Our class C is dened implicitly via a kernel kas the set of half-spaces in an SV feature space We do not try to minimizethe volume of C in input space Instead we minimize an SV-style regular-izer that using a kernel controls the smoothness of the estimated functiondescribing C In terms of multidimensional quantiles our approach can bethought of as employing l(Cw) D kwk2 where Cw D fx fw(x) cedil rg Here
1446 Bernhard Scholkopf et al
(w r ) are a weight vector and an offset parameterizing a hyperplane in thefeature space associated with the kernel
The main contribution of this work is that we propose an algorithm thathas tractable computational complexity even in high-dimensional casesOur theory which uses tools very similar to those used by Polonik givesresults that we expect will be of more use in a nite sample size setting
3 Algorithms
We rst introduce terminology and notation conventions We consider train-ing data
x1 x` 2 X (31)
where ` 2 N is the number of observations and X is some set For simplicitywe think of it as a compact subset of RN Let W be a feature map X Fthat is a map into an inner product space F such that the inner product inthe image of W can be computed by evaluating some simple kernel (BoserGuyon amp Vapnik (1992) Vapnik (1995) Scholkopf Burges et al (1999))
k(x y) D (W (x) cent W (y) ) (32)
such as the gaussian kernel
k(x y) D eiexclkxiexclyk2c (33)
Indices i and j are understood to range over 1 ` (in compact notationi j 2 [ ]) Boldface Greek letters denote -dimensional vectors whose com-ponents are labeled using a normal typeface
In the remainder of this section we develop an algorithm that returnsa function f that takes the value C1 in a ldquosmallrdquo region capturing mostof the data points and iexcl1 elsewhere Our strategy is to map the data intothe feature space corresponding to the kernel and to separate them fromthe origin with maximum margin For a new point x the value f (x) isdetermined by evaluating which side of the hyperplane it falls on in featurespace Via the freedom to use different types of kernel functions this simplegeometric picture corresponds to a variety of nonlinear estimators in inputspace
To separate the data set from the origin we solve the following quadraticprogram
minw2Fraquo2R`r2R
12kwk2 C 1
ordm`
Pi ji iexcl r (34)
subject to (w cent W (xi) ) cedil r iexclji ji cedil 0 (35)
Here ordm 2 (0 1] is a parameter whose meaning will become clear later
Estimating the Support of a High-Dimensional Distribution 1447
Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function
f (x) D sgn((w cent W (x) ) iexcl r ) (36)
will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm
Using multipliers ai macri cedil 0 we introduce a Lagrangian
L(w raquo r reg macr) D12
kwk2 C1
ordm`
X
iji iexcl r
iexclX
iai ( (w cent W (xi) ) iexcl r C ji) iexcl
X
imacriji (37)
and set the derivatives with respect to the primal variables w raquo r equal tozero yielding
w DX
iaiW (xi) (38)
ai D1
ordm`iexcl macri middot
1ordm`
X
iai D 1 (39)
In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion
f (x) D sgn
sup3X
iaik(xi x) iexcl r
acute (310)
Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem
minreg
12
X
ij
aiajk(xi xj) subject to 0 middot ai middot1
ordm`
X
iai D 1 (311)
One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises
r D (w cent W (xi) ) DX
j
ajk(xj xi) (312)
1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise
1448 Bernhard Scholkopf et al
Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP
i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged
It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm
To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)
minR2Rraquo2R`c2F
R2 C1
ordm`
X
iji
subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)
This leads to the dual
minreg
X
ij
aiajk(xi xj) iexclX
iaik(xi xi) (314)
subject to 0 middot ai middot1
ordm`
X
iai D 1 (315)
and the solution
c DX
iaiW (xi) (316)
corresponding to a decision function of the form
f (x) D sgn
0
R2 iexclX
ij
aiajk(xi xj) C 2X
iaik(xi x) iexcl k(x x)
1
A (317)
Estimating the Support of a High-Dimensional Distribution 1449
Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero
For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment
4 Optimization
Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O
iexcl 3cent(cf Platt 1999) The algorithm is a modied version
of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)
The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables
41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to
mina1 a2
12
2X
i jD1
aiajKij C2X
iD1
aiCi C C (41)
1450 Bernhard Scholkopf et al
with Ci DP`
jD3 ajKij and C DP`
i jD3 aiajKij subject to
0 middot a1 a2 middot1
ordm`
2X
iD1
ai D D (42)
where D D 1 iexclP`
iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to
obtain
mina2
12
(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12
a22K22
C (Diexcl a2 )C1 C a2C2 (43)
with the derivative
iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)
Setting this to zero and solving for a2 we get
a2 DD(K11 iexcl K12) C C1 iexcl C2
K11 C K22 iexcl 2K12 (45)
Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1
The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of
the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren
1 acurren2 denote the values of their Lagrange parameter
before the step Then the corresponding outputs (cf equation 310) read
Oi D K1iacurren1 C K2ia
curren2 C Ci (46)
Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren
1
a2 D acurren2 C
O1 iexcl O2
K11 C K22 iexcl 2K12 (47)
which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction
Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization
Estimating the Support of a High-Dimensional Distribution 1451
42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that
Pi ai D 1 Moreover
we set the initial r to maxfOi i 2 [ ] ai gt 0g
43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position
We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to
j D arg maxn2SVnb
|Oi iexcl On | (48)
We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of
the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates
In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below
In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)
We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end
In the next section it will be argued that subtracting something from r
is actually advisable also from a statistical point of view
2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1445
of the probability mass Estimators of the form C`(a) are called minimumvolume estimators
Observe that for C being all Borel measurable sets C(1) is the supportof the density p corresponding to P assuming it exists (Note that C(1) iswell dened even when p does not exist) For smaller classes C C(1) is theminimum volume C 2 C containing the support of p
Turning to the case where a lt 1 it seems the rst work was reportedby Sager (1979) and then Hartigan (1987) who considered X D R2 with Cbeing the class of closed convex sets in X (They actually considered densitycontour clusters cf appendix A for a denition) Nolan (1991) consideredhigher dimensions with C being the class of ellipsoids Tsybakov (1997)has studied an estimator based on piecewise polynomial approximation ofC(a) and has shown it attains the asymptotically minimax rate for certainclasses of densities p Polonik (1997) has studied the estimation of C(a)by C`(a) He derived asymptotic rates of convergence in terms of variousmeasures of richness of C He considered both VC classes and classes witha log 2 -covering number with bracketing of order O
iexcl2 iexclrcent for r gt 0 More
information on minimum volume estimators can be found in that work andin appendix A
A number of applications have been suggested for these techniques Theyinclude problems in medical diagnosis (Tarassenko Hayton Cerneaz ampBrady 1995) marketing (Ben-David amp Lindenbaum 1997) condition moni-toring of machines (Devroyeamp Wise 1980) estimating manufacturing yields(Stoneking 1999) econometrics and generalized nonlinear principal curves(Tsybakov 1997Korostelev amp Tsybakov 1993) regression and spectral anal-ysis (Polonik 1997) tests for multimodality and clustering (Polonik 1995b)and others (Muller 1992) Most of this work in particular that with a the-oretical slant does not go all the way in devising practical algorithms thatwork on high-dimensional real-world-problems A notable exception to thisis the work of Tarassenko et al (1995)
Polonik (1995a) has shown how one can use estimators of C(a) to con-struct density estimators The point of doing this is that it allows one toencode a range of prior assumptions about the true density p that would beimpossible to do within the traditional density estimation framework Hehas shown asymptotic consistency and rates of convergence for densitiesbelonging to VC-classes or with a known rate of growth of metric entropywith bracketing
Let us conclude this section with a short discussion of how the workpresented here relates to the above This article describes an algorithm thatnds regions close to C(a) Our class C is dened implicitly via a kernel kas the set of half-spaces in an SV feature space We do not try to minimizethe volume of C in input space Instead we minimize an SV-style regular-izer that using a kernel controls the smoothness of the estimated functiondescribing C In terms of multidimensional quantiles our approach can bethought of as employing l(Cw) D kwk2 where Cw D fx fw(x) cedil rg Here
1446 Bernhard Scholkopf et al
(w r ) are a weight vector and an offset parameterizing a hyperplane in thefeature space associated with the kernel
The main contribution of this work is that we propose an algorithm thathas tractable computational complexity even in high-dimensional casesOur theory which uses tools very similar to those used by Polonik givesresults that we expect will be of more use in a nite sample size setting
3 Algorithms
We rst introduce terminology and notation conventions We consider train-ing data
x1 x` 2 X (31)
where ` 2 N is the number of observations and X is some set For simplicitywe think of it as a compact subset of RN Let W be a feature map X Fthat is a map into an inner product space F such that the inner product inthe image of W can be computed by evaluating some simple kernel (BoserGuyon amp Vapnik (1992) Vapnik (1995) Scholkopf Burges et al (1999))
k(x y) D (W (x) cent W (y) ) (32)
such as the gaussian kernel
k(x y) D eiexclkxiexclyk2c (33)
Indices i and j are understood to range over 1 ` (in compact notationi j 2 [ ]) Boldface Greek letters denote -dimensional vectors whose com-ponents are labeled using a normal typeface
In the remainder of this section we develop an algorithm that returnsa function f that takes the value C1 in a ldquosmallrdquo region capturing mostof the data points and iexcl1 elsewhere Our strategy is to map the data intothe feature space corresponding to the kernel and to separate them fromthe origin with maximum margin For a new point x the value f (x) isdetermined by evaluating which side of the hyperplane it falls on in featurespace Via the freedom to use different types of kernel functions this simplegeometric picture corresponds to a variety of nonlinear estimators in inputspace
To separate the data set from the origin we solve the following quadraticprogram
minw2Fraquo2R`r2R
12kwk2 C 1
ordm`
Pi ji iexcl r (34)
subject to (w cent W (xi) ) cedil r iexclji ji cedil 0 (35)
Here ordm 2 (0 1] is a parameter whose meaning will become clear later
Estimating the Support of a High-Dimensional Distribution 1447
Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function
f (x) D sgn((w cent W (x) ) iexcl r ) (36)
will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm
Using multipliers ai macri cedil 0 we introduce a Lagrangian
L(w raquo r reg macr) D12
kwk2 C1
ordm`
X
iji iexcl r
iexclX
iai ( (w cent W (xi) ) iexcl r C ji) iexcl
X
imacriji (37)
and set the derivatives with respect to the primal variables w raquo r equal tozero yielding
w DX
iaiW (xi) (38)
ai D1
ordm`iexcl macri middot
1ordm`
X
iai D 1 (39)
In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion
f (x) D sgn
sup3X
iaik(xi x) iexcl r
acute (310)
Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem
minreg
12
X
ij
aiajk(xi xj) subject to 0 middot ai middot1
ordm`
X
iai D 1 (311)
One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises
r D (w cent W (xi) ) DX
j
ajk(xj xi) (312)
1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise
1448 Bernhard Scholkopf et al
Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP
i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged
It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm
To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)
minR2Rraquo2R`c2F
R2 C1
ordm`
X
iji
subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)
This leads to the dual
minreg
X
ij
aiajk(xi xj) iexclX
iaik(xi xi) (314)
subject to 0 middot ai middot1
ordm`
X
iai D 1 (315)
and the solution
c DX
iaiW (xi) (316)
corresponding to a decision function of the form
f (x) D sgn
0
R2 iexclX
ij
aiajk(xi xj) C 2X
iaik(xi x) iexcl k(x x)
1
A (317)
Estimating the Support of a High-Dimensional Distribution 1449
Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero
For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment
4 Optimization
Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O
iexcl 3cent(cf Platt 1999) The algorithm is a modied version
of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)
The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables
41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to
mina1 a2
12
2X
i jD1
aiajKij C2X
iD1
aiCi C C (41)
1450 Bernhard Scholkopf et al
with Ci DP`
jD3 ajKij and C DP`
i jD3 aiajKij subject to
0 middot a1 a2 middot1
ordm`
2X
iD1
ai D D (42)
where D D 1 iexclP`
iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to
obtain
mina2
12
(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12
a22K22
C (Diexcl a2 )C1 C a2C2 (43)
with the derivative
iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)
Setting this to zero and solving for a2 we get
a2 DD(K11 iexcl K12) C C1 iexcl C2
K11 C K22 iexcl 2K12 (45)
Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1
The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of
the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren
1 acurren2 denote the values of their Lagrange parameter
before the step Then the corresponding outputs (cf equation 310) read
Oi D K1iacurren1 C K2ia
curren2 C Ci (46)
Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren
1
a2 D acurren2 C
O1 iexcl O2
K11 C K22 iexcl 2K12 (47)
which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction
Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization
Estimating the Support of a High-Dimensional Distribution 1451
42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that
Pi ai D 1 Moreover
we set the initial r to maxfOi i 2 [ ] ai gt 0g
43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position
We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to
j D arg maxn2SVnb
|Oi iexcl On | (48)
We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of
the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates
In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below
In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)
We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end
In the next section it will be argued that subtracting something from r
is actually advisable also from a statistical point of view
2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1446 Bernhard Scholkopf et al
(w r ) are a weight vector and an offset parameterizing a hyperplane in thefeature space associated with the kernel
The main contribution of this work is that we propose an algorithm thathas tractable computational complexity even in high-dimensional casesOur theory which uses tools very similar to those used by Polonik givesresults that we expect will be of more use in a nite sample size setting
3 Algorithms
We rst introduce terminology and notation conventions We consider train-ing data
x1 x` 2 X (31)
where ` 2 N is the number of observations and X is some set For simplicitywe think of it as a compact subset of RN Let W be a feature map X Fthat is a map into an inner product space F such that the inner product inthe image of W can be computed by evaluating some simple kernel (BoserGuyon amp Vapnik (1992) Vapnik (1995) Scholkopf Burges et al (1999))
k(x y) D (W (x) cent W (y) ) (32)
such as the gaussian kernel
k(x y) D eiexclkxiexclyk2c (33)
Indices i and j are understood to range over 1 ` (in compact notationi j 2 [ ]) Boldface Greek letters denote -dimensional vectors whose com-ponents are labeled using a normal typeface
In the remainder of this section we develop an algorithm that returnsa function f that takes the value C1 in a ldquosmallrdquo region capturing mostof the data points and iexcl1 elsewhere Our strategy is to map the data intothe feature space corresponding to the kernel and to separate them fromthe origin with maximum margin For a new point x the value f (x) isdetermined by evaluating which side of the hyperplane it falls on in featurespace Via the freedom to use different types of kernel functions this simplegeometric picture corresponds to a variety of nonlinear estimators in inputspace
To separate the data set from the origin we solve the following quadraticprogram
minw2Fraquo2R`r2R
12kwk2 C 1
ordm`
Pi ji iexcl r (34)
subject to (w cent W (xi) ) cedil r iexclji ji cedil 0 (35)
Here ordm 2 (0 1] is a parameter whose meaning will become clear later
Estimating the Support of a High-Dimensional Distribution 1447
Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function
f (x) D sgn((w cent W (x) ) iexcl r ) (36)
will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm
Using multipliers ai macri cedil 0 we introduce a Lagrangian
L(w raquo r reg macr) D12
kwk2 C1
ordm`
X
iji iexcl r
iexclX
iai ( (w cent W (xi) ) iexcl r C ji) iexcl
X
imacriji (37)
and set the derivatives with respect to the primal variables w raquo r equal tozero yielding
w DX
iaiW (xi) (38)
ai D1
ordm`iexcl macri middot
1ordm`
X
iai D 1 (39)
In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion
f (x) D sgn
sup3X
iaik(xi x) iexcl r
acute (310)
Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem
minreg
12
X
ij
aiajk(xi xj) subject to 0 middot ai middot1
ordm`
X
iai D 1 (311)
One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises
r D (w cent W (xi) ) DX
j
ajk(xj xi) (312)
1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise
1448 Bernhard Scholkopf et al
Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP
i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged
It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm
To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)
minR2Rraquo2R`c2F
R2 C1
ordm`
X
iji
subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)
This leads to the dual
minreg
X
ij
aiajk(xi xj) iexclX
iaik(xi xi) (314)
subject to 0 middot ai middot1
ordm`
X
iai D 1 (315)
and the solution
c DX
iaiW (xi) (316)
corresponding to a decision function of the form
f (x) D sgn
0
R2 iexclX
ij
aiajk(xi xj) C 2X
iaik(xi x) iexcl k(x x)
1
A (317)
Estimating the Support of a High-Dimensional Distribution 1449
Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero
For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment
4 Optimization
Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O
iexcl 3cent(cf Platt 1999) The algorithm is a modied version
of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)
The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables
41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to
mina1 a2
12
2X
i jD1
aiajKij C2X
iD1
aiCi C C (41)
1450 Bernhard Scholkopf et al
with Ci DP`
jD3 ajKij and C DP`
i jD3 aiajKij subject to
0 middot a1 a2 middot1
ordm`
2X
iD1
ai D D (42)
where D D 1 iexclP`
iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to
obtain
mina2
12
(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12
a22K22
C (Diexcl a2 )C1 C a2C2 (43)
with the derivative
iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)
Setting this to zero and solving for a2 we get
a2 DD(K11 iexcl K12) C C1 iexcl C2
K11 C K22 iexcl 2K12 (45)
Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1
The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of
the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren
1 acurren2 denote the values of their Lagrange parameter
before the step Then the corresponding outputs (cf equation 310) read
Oi D K1iacurren1 C K2ia
curren2 C Ci (46)
Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren
1
a2 D acurren2 C
O1 iexcl O2
K11 C K22 iexcl 2K12 (47)
which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction
Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization
Estimating the Support of a High-Dimensional Distribution 1451
42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that
Pi ai D 1 Moreover
we set the initial r to maxfOi i 2 [ ] ai gt 0g
43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position
We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to
j D arg maxn2SVnb
|Oi iexcl On | (48)
We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of
the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates
In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below
In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)
We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end
In the next section it will be argued that subtracting something from r
is actually advisable also from a statistical point of view
2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1447
Since nonzero slack variables ji are penalized in the objective functionwe can expect that if w and r solve this problem then the decision function
f (x) D sgn((w cent W (x) ) iexcl r ) (36)
will be positive for most examples xi contained in the training set1 whilethe SV type regularization term kwk will still be small The actual trade-offbetween these two goals is controlled by ordm
Using multipliers ai macri cedil 0 we introduce a Lagrangian
L(w raquo r reg macr) D12
kwk2 C1
ordm`
X
iji iexcl r
iexclX
iai ( (w cent W (xi) ) iexcl r C ji) iexcl
X
imacriji (37)
and set the derivatives with respect to the primal variables w raquo r equal tozero yielding
w DX
iaiW (xi) (38)
ai D1
ordm`iexcl macri middot
1ordm`
X
iai D 1 (39)
In equation 38 all patterns fxi i 2 [ ] ai gt 0g are called support vec-tors Together with equation 32 the SV expansion transforms the decisionfunction equation 36 into a kernel expansion
f (x) D sgn
sup3X
iaik(xi x) iexcl r
acute (310)
Substituting equation 38 and equation 39 into L (see equation 37) andusing equation 32 we obtain the dual problem
minreg
12
X
ij
aiajk(xi xj) subject to 0 middot ai middot1
ordm`
X
iai D 1 (311)
One can show that at the optimum the two inequality constraints equa-tion 35 become equalities if ai and macri are nonzero that is if 0 lt ai lt 1(ordm )Therefore we can recover r by exploiting that for any such ai the corre-sponding pattern xi satises
r D (w cent W (xi) ) DX
j
ajk(xj xi) (312)
1 We use the convention that sgn(z) equals 1 for z cedil 0 and iexcl1 otherwise
1448 Bernhard Scholkopf et al
Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP
i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged
It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm
To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)
minR2Rraquo2R`c2F
R2 C1
ordm`
X
iji
subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)
This leads to the dual
minreg
X
ij
aiajk(xi xj) iexclX
iaik(xi xi) (314)
subject to 0 middot ai middot1
ordm`
X
iai D 1 (315)
and the solution
c DX
iaiW (xi) (316)
corresponding to a decision function of the form
f (x) D sgn
0
R2 iexclX
ij
aiajk(xi xj) C 2X
iaik(xi x) iexcl k(x x)
1
A (317)
Estimating the Support of a High-Dimensional Distribution 1449
Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero
For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment
4 Optimization
Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O
iexcl 3cent(cf Platt 1999) The algorithm is a modied version
of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)
The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables
41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to
mina1 a2
12
2X
i jD1
aiajKij C2X
iD1
aiCi C C (41)
1450 Bernhard Scholkopf et al
with Ci DP`
jD3 ajKij and C DP`
i jD3 aiajKij subject to
0 middot a1 a2 middot1
ordm`
2X
iD1
ai D D (42)
where D D 1 iexclP`
iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to
obtain
mina2
12
(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12
a22K22
C (Diexcl a2 )C1 C a2C2 (43)
with the derivative
iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)
Setting this to zero and solving for a2 we get
a2 DD(K11 iexcl K12) C C1 iexcl C2
K11 C K22 iexcl 2K12 (45)
Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1
The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of
the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren
1 acurren2 denote the values of their Lagrange parameter
before the step Then the corresponding outputs (cf equation 310) read
Oi D K1iacurren1 C K2ia
curren2 C Ci (46)
Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren
1
a2 D acurren2 C
O1 iexcl O2
K11 C K22 iexcl 2K12 (47)
which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction
Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization
Estimating the Support of a High-Dimensional Distribution 1451
42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that
Pi ai D 1 Moreover
we set the initial r to maxfOi i 2 [ ] ai gt 0g
43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position
We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to
j D arg maxn2SVnb
|Oi iexcl On | (48)
We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of
the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates
In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below
In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)
We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end
In the next section it will be argued that subtracting something from r
is actually advisable also from a statistical point of view
2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1448 Bernhard Scholkopf et al
Note that if ordm approaches 0 the upper boundaries on the Lagrange mul-tipliers tend to innity that is the second inequality constraint in equa-tion 311 becomes void The problem then resembles the correspondinghard margin algorithm since the penalization of errors becomes innite ascan be seen from the primal objective function (see equation 34) It is still afeasible problem since we have placed no restriction on the offset r so it canbecome a large negative number in order to satisfy equation 35 If we hadrequired r cedil 0 from the start we would have ended up with the constraintP
i ai cedil 1 instead of the corresponding equality constraint in equation 311and the multipliers ai could have diverged
It is instructive to compare equation 311 to a Parzen windows estimatorTo this end suppose we use a kernel that can be normalized as a density ininput space such as the gaussian (see equation 33) If we useordm D 1 then thetwo constraints only allow the solution a1 D cent cent cent D a` D 1 Thus the kernelexpansion in equation 310 reduces to a Parzen windows estimate of theunderlying density For ordm lt 1 the equality constraint in equation 311 stillensures that the decision function is a thresholded density however in thatcase the density will be represented only by a subset of training examples(the SVs)mdashthose that are important for the decision (see equation 310) tobe taken Section 5 will explain the precise meaning of ordm
To conclude this section we note that one can also use balls to describe thedata in feature space close in spirit to the algorithms of Scholkopf Burgesand Vapnik (1995) with hard boundaries and Tax and Duin (1999) withldquosoft marginsrdquo Again we try to put most of the data into a small ball bysolving for ordm 2 (0 1)
minR2Rraquo2R`c2F
R2 C1
ordm`
X
iji
subject to kW (xi ) iexcl ck2 middot R2 C ji ji cedil 0 for i 2 [ ] (313)
This leads to the dual
minreg
X
ij
aiajk(xi xj) iexclX
iaik(xi xi) (314)
subject to 0 middot ai middot1
ordm`
X
iai D 1 (315)
and the solution
c DX
iaiW (xi) (316)
corresponding to a decision function of the form
f (x) D sgn
0
R2 iexclX
ij
aiajk(xi xj) C 2X
iaik(xi x) iexcl k(x x)
1
A (317)
Estimating the Support of a High-Dimensional Distribution 1449
Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero
For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment
4 Optimization
Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O
iexcl 3cent(cf Platt 1999) The algorithm is a modied version
of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)
The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables
41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to
mina1 a2
12
2X
i jD1
aiajKij C2X
iD1
aiCi C C (41)
1450 Bernhard Scholkopf et al
with Ci DP`
jD3 ajKij and C DP`
i jD3 aiajKij subject to
0 middot a1 a2 middot1
ordm`
2X
iD1
ai D D (42)
where D D 1 iexclP`
iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to
obtain
mina2
12
(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12
a22K22
C (Diexcl a2 )C1 C a2C2 (43)
with the derivative
iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)
Setting this to zero and solving for a2 we get
a2 DD(K11 iexcl K12) C C1 iexcl C2
K11 C K22 iexcl 2K12 (45)
Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1
The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of
the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren
1 acurren2 denote the values of their Lagrange parameter
before the step Then the corresponding outputs (cf equation 310) read
Oi D K1iacurren1 C K2ia
curren2 C Ci (46)
Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren
1
a2 D acurren2 C
O1 iexcl O2
K11 C K22 iexcl 2K12 (47)
which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction
Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization
Estimating the Support of a High-Dimensional Distribution 1451
42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that
Pi ai D 1 Moreover
we set the initial r to maxfOi i 2 [ ] ai gt 0g
43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position
We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to
j D arg maxn2SVnb
|Oi iexcl On | (48)
We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of
the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates
In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below
In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)
We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end
In the next section it will be argued that subtracting something from r
is actually advisable also from a statistical point of view
2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1449
Similar to the above R2 is computed such that for any xi with 0 lt ai lt1(ordm ) the argument of the sgn is zero
For kernels k(x y) that depend on only x iexcl y k(x x) is constant In thiscase the equality constraint implies that the linear term in the dual targetfunction is constant and the problem 314 and 315 turns out to be equiv-alent to equation 311 It can be shown that the same holds true for thedecision function hence the two algorithms coincide in that case This isgeometrically plausible For constant k(x x) all mapped patterns lie on asphere in feature space Therefore nding the smallest sphere (containingthe points) really amounts to nding the smallest segment of the spherethat the data live on The segment however can be found in a straightfor-ward way by simply intersecting the data sphere with a hyperplane thehyperplane with maximum margin of separation to the origin will cut offthe smallest segment
4 Optimization
Section 3 formulated quadratic programs (QPs) for computing regions thatcapture a certain fraction of the data These constrained optimization prob-lems can be solved using an off-the-shelf QP package to compute the solu-tion They do however possess features that set them apart from genericQPs most notably the simplicity of the constraints In this section we de-scribe an algorithm that takes advantage of these features and empiricallyscales better to large data set sizes than a standard QP solver with time com-plexity of order O
iexcl 3cent(cf Platt 1999) The algorithm is a modied version
of SMO (sequential minimal optimization) an SV training algorithm origi-nally proposed for classication (Platt 1999) and subsequently adapted toregression estimation (Smola amp Scholkopf in press)
The strategy of SMO is to break up the constrained minimization ofequation 311 into the smallest optimization steps possible Due to the con-straint on the sum of the dual variables it is impossible to modify individualvariables separately without possibly violating the constraint We thereforeresort to optimizing over pairs of variables
41 Elementary Optimization Step For instance consider optimizingover a1 and a2 with all other variables xed Using the shorthand Kij Dk(xi xj) equation 311 then reduces to
mina1 a2
12
2X
i jD1
aiajKij C2X
iD1
aiCi C C (41)
1450 Bernhard Scholkopf et al
with Ci DP`
jD3 ajKij and C DP`
i jD3 aiajKij subject to
0 middot a1 a2 middot1
ordm`
2X
iD1
ai D D (42)
where D D 1 iexclP`
iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to
obtain
mina2
12
(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12
a22K22
C (Diexcl a2 )C1 C a2C2 (43)
with the derivative
iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)
Setting this to zero and solving for a2 we get
a2 DD(K11 iexcl K12) C C1 iexcl C2
K11 C K22 iexcl 2K12 (45)
Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1
The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of
the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren
1 acurren2 denote the values of their Lagrange parameter
before the step Then the corresponding outputs (cf equation 310) read
Oi D K1iacurren1 C K2ia
curren2 C Ci (46)
Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren
1
a2 D acurren2 C
O1 iexcl O2
K11 C K22 iexcl 2K12 (47)
which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction
Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization
Estimating the Support of a High-Dimensional Distribution 1451
42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that
Pi ai D 1 Moreover
we set the initial r to maxfOi i 2 [ ] ai gt 0g
43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position
We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to
j D arg maxn2SVnb
|Oi iexcl On | (48)
We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of
the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates
In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below
In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)
We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end
In the next section it will be argued that subtracting something from r
is actually advisable also from a statistical point of view
2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1450 Bernhard Scholkopf et al
with Ci DP`
jD3 ajKij and C DP`
i jD3 aiajKij subject to
0 middot a1 a2 middot1
ordm`
2X
iD1
ai D D (42)
where D D 1 iexclP`
iD3 ai We discard C which is independent of a1 and a2 and eliminate a1 to
obtain
mina2
12
(Diexcl a2 )2K11 C (Diexcl a2 )a2K12 C12
a22K22
C (Diexcl a2 )C1 C a2C2 (43)
with the derivative
iexcl (Diexcl a2 ) K11 C (Diexcl 2a2 )K12 C a2K22 iexcl C1 C C2 (44)
Setting this to zero and solving for a2 we get
a2 DD(K11 iexcl K12) C C1 iexcl C2
K11 C K22 iexcl 2K12 (45)
Once a2 is found a1 can be recovered from a1 D Diexcl a2 If the new point(a1 a2) is outside of [0 1 (ordm ) ] the constrained optimum is found by pro-jecting a2 from equation 45 into the region allowed by the constraints andthen recomputing a1
The offset r is recomputed after every such stepAdditional insight can be obtained by rewriting equation 45 in terms of
the outputs of the kernel expansion on the examples x1 and x2 before theoptimization step Let acurren
1 acurren2 denote the values of their Lagrange parameter
before the step Then the corresponding outputs (cf equation 310) read
Oi D K1iacurren1 C K2ia
curren2 C Ci (46)
Using the latter to eliminate the Ci we end up with an update equation fora2 which does not explicitly depend on acurren
1
a2 D acurren2 C
O1 iexcl O2
K11 C K22 iexcl 2K12 (47)
which shows that the update is essentially the fraction of rst and secondderivative of the objective function along the direction of ordm-constraint sat-isfaction
Clearly the same elementary optimization step can be applied to anypair of two variables not just a1 and a2 We next briey describe how to dothe overall optimization
Estimating the Support of a High-Dimensional Distribution 1451
42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that
Pi ai D 1 Moreover
we set the initial r to maxfOi i 2 [ ] ai gt 0g
43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position
We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to
j D arg maxn2SVnb
|Oi iexcl On | (48)
We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of
the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates
In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below
In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)
We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end
In the next section it will be argued that subtracting something from r
is actually advisable also from a statistical point of view
2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1451
42 Initialization of the Algorithm We start by setting a fraction ordm ofall ai randomly chosen to 1 (ordm ) If ordm` is not an integer then one of theexamples is set to a value in (0 1 (ordm ) ) to ensure that
Pi ai D 1 Moreover
we set the initial r to maxfOi i 2 [ ] ai gt 0g
43 Optimization Algorithm We then select a rst variable for the ele-mentary optimization step in one of the two following ways Here we usethe shorthand SVnb for the indices of variables that are not at bound that isSVnb D fi i 2 [ ] 0 lt ai lt 1(ordm )g At the end these correspond to pointsthat will sit exactly on the hyperplane and that will therefore have a stronginuence on its precise position
We scan over the entire data set2 until we nd a variable violating aKarush-kuhn-Tucker (KKT) condition (Bertsekas 1995) that is a point suchthat (Oi iexclr ) cent ai gt 0 or (r iexclOi) cent (1(ordm ) iexcl ai) gt 0 Once we have found onesay ai we pick aj according to
j D arg maxn2SVnb
|Oi iexcl On | (48)
We repeat that step but the scan is performed only over SVnbIn practice one scan of the rst type is followed by multiple scans of
the second type until there are no KKT violators in SVnb whereupon theoptimization goes back to a single scan of the rst type If it nds no KKTviolators the optimization terminates
In unusual circumstances the choice heuristic equation 48 cannot makepositive progress Therefore a hierarchy of other choice heuristics is appliedto ensure positive progress These other heuristics are the same as in the caseof pattern recognition (cf Platt 1999) and have been found to work well inthe experiments we report below
In our experiments with SMO applied to distribution support estima-tion we have always found it to converge However to ensure convergenceeven in rare pathological conditions the algorithm can be modied slightly(Keerthi Shevade Bhattacharyya amp Murthy 1999)
We end this section by stating a trick that is of importance in practicalimplementations In practice one has to use a nonzero accuracy tolerancewhen checking whether two quantities are equal In particular comparisonsof this type are used in determining whether a point lies on the margin Sincewe want the nal decision function to evaluate to 1 for points that lie on themargin we need to subtract this constant from the offset r at the end
In the next section it will be argued that subtracting something from r
is actually advisable also from a statistical point of view
2 This scan can be accelerated by not checking patterns that are on the correct side ofthe hyperplane by a large margin using the method of Joachims (1999)
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1452 Bernhard Scholkopf et al
5 Theory
We now analyze the algorithm theoretically starting with the uniqueness ofthe hyperplane (proposition 1) We then describe the connection to patternrecognition (proposition 2) and show that the parameter ordm characterizesthe fractions of SVs and outliers (proposition 3) Following that we give arobustness result for the soft margin (proposition 4) and nally we present atheoretical result on the generalization error (theorem 1) Some of the proofsare given in appendix B
In this section we use italic letters to denote the feature space images ofthe corresponding patterns in input space that is
xi D W (xi) (51)
Denition 1 A data set
x1 x` (52)
is called separable if there exists some w 2 F such that (w cent xi ) gt 0 for i 2 [ ]
If we use a gaussian kernel (see equation 33) then any data set x1 x`
is separable after it is mapped into feature space To see this note thatk(xi xj) gt 0 for all i j thus all inner products between mapped patterns arepositive implying that all patterns lie inside the same orthant Moreoversince k(xi xi) D 1 forall i they all have unit length Hence they are separablefrom the origin
Proposition 1 (supporting hyperplane) If the data set equation 52 is sepa-rable then there exists a unique supporting hyperplane with the properties that (1)it separates all data from the origin and (2) its distance to the origin is maximalamong all such hyperplanes For any r gt 0 it is given by
minw2F
12
kwk2 subject to (w cent xi) cedil r i 2 [ ] (53)
The following result elucidates the relationship between single-class clas-sication and binary classication
Proposition 2 (connection to pattern recognition) (i) Suppose (w r ) param-eterizes the supporting hyperplane for the data in equation 52 Then (w 0) param-eterizes the optimal separating hyperplane for the labeled data set
f(x1 1) (x` 1) (iexclx1 iexcl1) (iexclx iexcl1)g (54)
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1453
(ii) Suppose (w 0) parameterizes the optimal separating hyperplane passingthrough the origin for a labeled data set
copyiexclx1 y1
cent
iexclx` y`
centordf
iexclyi 2 fsect1g for i 2 [ ]
cent (55)
aligned such that (w cent xi) is positive for yi D 1 Suppose moreover that r kwk isthe margin of the optimal hyperplane Then (w r ) parameterizes the supportinghyperplane for the unlabeled data set
copyy1x1 y`x`
ordf (56)
Note that the relationship is similar for nonseparable problems In thatcase margin errors in binary classication (points that are either on thewrong side of the separating hyperplane or fall inside the margin) translateinto outliers in single-class classication (points that fall on the wrong sideof the hyperplane)Proposition 2 thenholds cum grano salis for the trainingsets with margin errors and outliers respectively removed
The utility ofproposition2 lies in the fact that it allows us to recycle certainresults proven for binary classication (Scholkopf Smola Williamson ampBartlett 2000) for use in the single-class scenario The following explainingthe signicance of the parameter ordm is such a case
Proposition 3 ordm-property Assume the solution of equation 34 and 35 satisesr 6D 0 The following statements hold
i ordm is an upper bound on the fraction of outliers that is training points outsidethe estimated region
ii ordm is a lower bound on the fraction of SVs
iii Suppose the data (see equation 52) were generated independently from adistribution P(x) which does not contain discrete components Supposemoreover that the kernel is analytic and nonconstant With probability 1asymptotically ordm equals both the fraction of SVs and the fraction of outliers
Note that this result also applies to the soft margin ball algorithm of Taxand Duin (1999) provided that it is stated in the ordm-parameterization givenin section 3
Proposition 4 (resistance) Local movements of outliers parallel to w do notchange the hyperplane
Note that although the hyperplane does not change its parameterization inw and r does
We now move on to the subject of generalization Our goal is to bound theprobability that a novel point drawn from the same underlying distribution
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1454 Bernhard Scholkopf et al
lies outside the estimated region We present a ldquomarginalizedrdquo analysis thatin fact provides a bound on the probability that a novel point lies outsidethe region slightly larger than the estimated one
Denition 2 Let f be a real-valued function on a space X Fix h 2 R Forx 2 X let d(x f h ) D maxf0 h iexcl f (x)g Similarly for a training sequence X D(x1 x`) we dene
D (X f h ) DX
x2X
d(x f h )
In the following log denotes logarithms to base 2 and ln denotes naturallogarithms
Theorem 1 (generalization error bound) Suppose we are given a set of ` ex-amples X 2 X ` generated iid from an unknown distribution P which does notcontain discrete components Suppose moreover that we solve the optimizationproblem equations 34 and 35 (or equivalently equation 311) and obtain a solu-tion fw given explicitly by equation (310) Let Rwr D fx fw(x) cedil rg denote theinduced decision region With probability 1 iexcld over the draw of the random sampleX 2 X ` for any c gt 0
Pcopyx0 x0 62 Rwr iexclc
ordfmiddot
2`
sup3k C log
2
2d
acute (57)
where
k Dc1 log
iexclc2 Oc 2`
cent
Oc 2C
2DOc
logsup3
esup3
(2`iexcl 1) Oc2D
C 1acuteacute
C 2 (58)
c1 D 16c2 c2 D ln(2) iexcl4c2cent
c D 103 Oc D c kwk D D D(X fw0 r ) DD(X fwr 0) and r is given by equation (312)
The training sample X denes (via the algorithm) the decision regionRwr We expect that new points generated according to P will lie in Rwr The theorem gives a probabilistic guarantee that new points lie in the largerregion Rwr iexclc
The parameter ordm can be adjusted when running the algorithm to tradeoff incorporating outliers versus minimizing the ldquosizerdquo of Rwr Adjustingordm will change the value of D Note that since D is measured with respect tor while the bound applies to r iexclc any point that is outside the region thatthe bound applies to will make a contribution to D that is bounded awayfrom 0 Therefore equation 57 does not imply that asymptotically we willalways estimate the complete support
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1455
The parameter c allows one to trade off the condence with which onewishes the assertion of the theorem to hold against the size of the predictiveregion Rwriexclc one can see from equation 58 that k and hence the right-handside of equation 57 scales inversely with c In fact it scales inversely withOc that is it increases with w This justies measuring the complexity of theestimated region by the size of w and minimizing kwk2 in order to nd aregion that will generalize well In addition the theorem suggests not to usethe offset r returned by the algorithm which would correspond to c D 0but a smaller value r iexcl c (with c gt 0)
We do not claim that using theorem 1 directly is a practical means todetermine the parameters ordm and c explicitly It is loose in several ways Wesuspect c is too large by a factor of more than 50 Furthermore no accountis taken of the smoothness of the kernel used If that were done (by usingrened bounds on the covering numbers of the induced class of functionsas in Williamson Smola and Scholkopf (1998)) then the rst term in equa-tion 58 would increase much more slowly when decreasing c The fact thatthe second term would not change indicates a different trade-off point Nev-ertheless the theorem gives one some condence that ordm and c are suitableparameters to adjust
6 Experiments
We apply the method to articial and real-world data Figure 1 displaystwo-dimensional (2D) toy examples and shows how the parameter settingsinuence the solution Figure 2 shows a comparison to a Parzen windowsestimator on a 2D problem along with a family of estimators that lie ldquoinbetweenrdquo the present one and the Parzen one
Figure 3 shows a plot of the outputs (w centW (x) ) on training and test sets ofthe US Postal Service (USPS) database of handwritten digits The databasecontains 9298 digit images of size 16 pound 16 D 256 the last 2007 constitute thetest set We used a gaussian kernel (see equation 33) that has the advan-tage that the data are always separable from the origin in feature space (cfthe comment following denition 1) For the kernel parameter c we used05 cent256 This value was chosen a priori it is a common value for SVM classi-ers on that data set (cf Scholkopf et al 1995)3 We fed our algorithm withthe training instances of digit 0 only Testing was done on both digit 0 andall other digits We present results for two values of ordm one large one smallfor values in between the results are qualitatively similar In the rst exper-iment we used ordm D 50 thus aiming for a description of ldquo0-nessrdquo which
3 Hayton Scholkopf Tarassenko and Anuzis (in press) use the following procedureto determine a value of c For small c all training points will become SVs the algorithmjust memorizes the data and will not generalize well As c increases the number of SVsinitially drops As a simple heuristic one can thus start with a small value of c and increaseit until the number of SVs does not decrease any further
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1456 Bernhard Scholkopf et al
ordm width c 05 05 05 05 01 05 05 01frac SVsOLs 054 043 059 047 024 003 065 038margin r kwk 084 070 062 048
Figure 1 (First two pictures) A single-class SVM applied to two toy problemsordm D c D 05 domain [iexcl1 1]2 In both cases at least a fraction of ordmof all examplesis in the estimated region (cf Table 1) The large value of ordm causes the additionaldata points in the upper left corner to have almost no inuence on the decisionfunction For smaller values of ordm such as 01 (third picture) the points cannotbe ignored anymore Alternatively one can force the algorithm to take theseoutliers (OLs) into account by changing the kernel width (see equation 33) Inthe fourth picture using c D 01 ordm D 05 the data are effectively analyzed ona different length scale which leads the algorithm to consider the outliers asmeaningful points
Figure 2 A single-class SVM applied to a toy problem c D 05 domain [iexcl1 1]2 for various settings of the offset r As discussed in section 3 ordm D 1 yields aParzen windows expansion However to get a Parzen windows estimator ofthe distributionrsquos support we must in that case not use the offset returnedby the algorithm (which would allow all points to lie outside the estimatedregion) Therefore in this experiment we adjusted the offset such that a fractionordm0 D 01 of patterns would lie outside From left to right we show the results forordm 2 f01 02 04 1g The right-most picture corresponds to the Parzen estimatorthat uses all kernels the other estimators use roughly a fraction of ordm kernelsNote that as a result of the averaging over all kernels the Parzen windowsestimate does not model the shape of the distribution very well for the chosenparameters
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1457
test train other offset
test train other offset
Figure 3 Experiments on the US Postal Service OCR data set Recognizer fordigit 0 output histogram for the exemplars of 0 in the trainingtest set and ontest exemplars of other digits The x-axis gives the output values that is theargument of the sgn function in equation 310 For ordm D 50 (top) we get 50SVs and 49 outliers (consistent with proposition 3 ) 44 true positive testexamples and zero false positives from the ldquoother rdquo class For ordm D 5 (bottom)we get 6 and 4 for SVs and outliers respectively In that case the true positiverate is improved to 91 while the false-positive rate increases to 7 The offsetr is marked in the graphs Note nally that the plots show a Parzen windowsdensity estimate of the output histograms In reality many examples sit exactlyat the threshold value (the nonbound SVs) Since this peak is smoothed out bythe estimator the fractions of outliers in the training set appear slightly largerthan it should be
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1458 Bernhard Scholkopf et al
6 9 2 8 1 8 8 6 5 3
2 3 8 7 0 3 0 8 2 7
Figure 4 Subset of 20 examples randomly drawn from the USPS test set withclass labels
captures only half of all zeros in the training set As shown in Figure 3 thisleads to zero false positives (although the learning machine has not seenany non-0rsquos during training it correctly identies all non-0rsquos as such) whilestill recognizing 44 of the digits 0 in the test set Higher recognition ratescan be achieved using smaller values of ordm for ordm D 5 we get 91 correctrecognition of digits 0 in the test set with a fairly moderate false-positiverate of 7
Although this experiment leads to encouraging results it did not re-ally address the actual task the algorithm was designed for Therefore wenext focused on a problem of novelty detection Again we used the USPSset however this time we trained the algorithm on the test set and usedit to identify outliers It is folklore in the community that the USPS testset (see Figure 4) contains a number of patterns that are hard or impossi-ble to classify due to segmentation errors or mislabeling (Vapnik 1995) Inthis experiment we augmented the input patterns by 10 extra dimensionscorresponding to the class labels of the digits The rationale is that if we dis-regarded the labels there would be no hope to identify mislabeled patternsas outliers With the labels the algorithm has the chance of identifying bothunusual patterns and usual patterns with unusual labels Figure 5 showsthe 20 worst outliers for the USPS test set respectively Note that the algo-rithm indeed extracts patterns that are very hard to assign to their respectiveclasses In the experiment we used the same kernel width as above and aordm value of 5 The latter was chosen roughly to reect our expectations asto how many ldquobadrdquo patterns there are in the test set Most good learningalgorithms achieve error rates of 3 to 5 on the USPS benchmark (for a listof results cf Vapnik 1995) Table 1 shows that ordm lets us control the fractionof outliers
In the last experiment we tested the run-time scaling behavior of theproposed SMO solver which is used for training the learning machine (seeFigure 6) It was found to depend on the value of ordm used For the smallvalues of ordm that are typically used in outlier detection tasks the algorithmscales very well to larger data sets with a dependency of training times onthe sample size which is at most quadratic
In addition to the experiments reported above the algorithm has since
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1459
9shy 513 1shy 507 0shy 458 1shy 377 7shy 282 2shy 216 3shy 200 9shy 186 5shy 179 0shy 162
3shy 153 6shy 143 6shy 128 0shy 123 7shy 117 5shy 93 0shy 78 7shy 58 6shy 52 3shy 48
Figure 5 Outliers identied by the proposed algorithm ranked by the negativeoutput of the SVM (the argument of equation 310) The outputs (for conveniencein units of 10iexcl5) are written underneath each image in italics the (alleged) classlabels are given in boldface Note that most of the examples are ldquodifcultrdquo inthat they are either atypical or even mislabeled
Table 1 Experimental Results for Various Values of the Outlier Control Constantordm USPS Test Set Size ` D 2007
Training Timeordm Fraction of OLs Fraction of SVs (CPU sec)
1 00 100 362 00 100 393 01 100 314 06 101 405 14 106 366 18 112 337 26 115 428 41 120 539 54 129 76
10 62 137 6520 169 226 19330 275 318 26940 371 417 68550 474 512 128460 582 610 115070 683 707 151280 785 805 220690 894 901 2349
Notes ordm bounds the fractions of outliers and support vec-tors from above and below respectively (cf Proposition 3)As we are not in the asymptotic regime there is some slackin the bounds nevertheless ordm can be used to control theabove fractions Note moreover that training times (CPUtime in seconds on a Pentium II running at 450 MHz) in-crease as ordm approaches 1 This is related to the fact that al-most all Lagrange multipliers will be at the upper boundin that case (cf section 4) The system used in the outlierdetection experiments is shown in boldface
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1460 Bernhard Scholkopf et al
Figure 6 Training times versus data set sizes ` (both axes depict logs at base 2CPU time in seconds on a Pentium II running at 450 MHz training on subsetsof the USPS test set) c D 05 cent 256 As in Table 1 it can be seen that larger valuesof ordm generally lead to longer training times (note that the plots use different y-axis ranges) However they also differ in their scaling with the sample size Theexponents can bedirectly read off from the slope of the graphs as they are plottedin log scale with equal axis spacing For small values of ordm (middot 5) the trainingtimes were approximately linear in the training set size The scaling gets worseas ordm increases For large values of ordm training times are roughly proportionalto the sample size raised to the power of 25 (right plot) The results shouldbe taken only as an indication of what is going on they were obtained usingfairly small training sets the largest being 2007 the size of the USPS test setAs a consequence they are fairly noisy and refer only to the examined regimeEncouragingly the scaling is better than the cubic one that one would expectwhen solving the optimization problem using all patterns at once (cf section 4)Moreover for small values of ordm that is those typically used in outlier detection(in Figure 5 we used ordm D 5) our algorithm was particularly efcient
been applied in several other domains such as the modeling of parameterregimes for the control of walking robots (Still amp Scholkopf in press) andcondition monitoring of jet engines (Hayton et al in press)
7 Discussion
One could view this work as an attempt to provide a new algorithm inline with Vapnikrsquos principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1461
than the one that one is actually interested in For example in situationswhere one is interested only in detecting novelty it is not always necessaryto estimate a full density model of the data Indeed density estimation ismore difcult than what we are doing in several respects
Mathematically a density will exist only if the underlying probabilitymeasure possesses an absolutely continuous distribution function How-ever the general problem of estimating the measure for a large class of setssay the sets measureable in Borelrsquos sense is not solvable (for a discussionsee Vapnik 1998) Therefore we need to restrict ourselves to making a state-ment about the measure of some sets Given a small class of sets the simplestestimator that accomplishes this task is the empirical measure which sim-ply looks at how many training points fall into the region of interest Ouralgorithm does the opposite It starts with the number of training pointsthat are supposed to fall into the region and then estimates a region withthe desired property Often there will be many such regions The solutionbecomes unique only by applying a regularizer which in our case enforcesthat the region be small in a feature space associated with the kernel
Therefore we must keep in mind that the measure of smallness in thissense depends on the kernel used in a way that is no different from anyother method that regularizes in a feature space A similar problem how-ever appears in density estimation already when done in input space Letp denote a density on X If we perform a (nonlinear) coordinate transfor-mation in the input domain X then the density values will change looselyspeaking what remains constant is p(x) cent dx while dx is transformed tooWhen directly estimating the probability measure of regions we are notfaced with this problem as the regions automatically change accordingly
An attractive property of the measure of smallness that we chose to use isthat it can also be placed in the context of regularization theory leading to aninterpretation of the solution as maximally smooth in a sense that dependson the specic kernel used More specically let us assume that k is Greenrsquosfunction of PcurrenP for an operator P mapping into some inner product space(Smola Scholkopf amp Muller 1998 Girosi 1998) and take a look at the dualobjective function that we minimize
X
i j
aiajkiexclxi xj
centD
X
i j
aiajiexclk (xi cent) cent dxj (cent)
cent
DX
i j
aiajiexclk (xi cent) cent
iexclPcurrenPk
cent iexclxj cent
centcent
DX
i j
aiajiexcliexcl
Pkcent
(xi cent) centiexclPk
cent iexclxj cent
centcent
D
0
sup3
PX
iaik
acute(xi cent) cent
0
PX
j
ajk
1
A iexclxj cent
cent1
A
D kPfk2
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1462 Bernhard Scholkopf et al
using f (x) DP
i aik(xi x) The regularization operators of common ker-nels can be shown to correspond to derivative operators (Poggio amp Girosi1990) therefore minimizing the dual objective function corresponds to max-imizing the smoothness of the function f (which is up to a thresholdingoperation the function we estimate) This in turn is related to a priorp( f ) raquo eiexclkPf k2
on the function spaceInterestingly as the minimization of the dual objective function also cor-
responds to a maximization of the margin in feature space an equivalentinterpretation is in terms of a prior on the distribution of the unknown otherclass (the ldquonovelrdquo class in a novelty detection problem) trying to separatethe data from the origin amounts to assuming that the novel examples liearound the origin
The main inspiration for our approach stems from the earliest work ofVapnik and collaborators In 1962 they proposed an algorithm for charac-terizing a set of unlabeled data points by separating it from the origin usinga hyperplane (Vapnik amp Lerner 1963 Vapnik amp Chervonenkis 1974) How-ever they quickly moved on to two-class classication problems in termsof both algorithms and the theoretical development of statistical learningtheory which originated in those days
From an algorithmic point of view we can identify two shortcomings ofthe original approach which may have caused research in this direction tostop for more than three decades Firstly the original algorithm in Vapnikand Chervonenkis (1974) was limited to linear decision rules in input spacesecond there was no way of dealing with outliers In conjunction theserestrictions are indeed severe a generic data set need not be separable fromthe origin by a hyperplane in input space
The two modications that we have incorporated dispose of these short-comings First the kernel trick allows for a much larger class of functionsby nonlinearly mapping into a high-dimensional feature space and therebyincreases the chances of a separation from the origin being possible In par-ticular usinga gaussian kernel (seeequation33) sucha separation isalwayspossible as shown in section 5 The second modication directly allows forthe possibility of outliers We have incorporated this softness of the decisionrule using the ordm-trick (Scholkopf Platt amp Smola 2000) and thus obtained adirect handle on the fraction of outliers
We believe that our approach proposing a concrete algorithm with well-behaved computational complexity (convex quadratic programming) for aproblem that so far has mainly been studied from a theoretical point of viewhas abundant practical applications To turn the algorithm into an easy-to-use black-box method for practitioners questions like the selection of kernelparameters (such as the width of a gaussian kernel) have to be tackled It isour hope that theoretical results as the one we have briey outlined and theone of Vapnik and Chapelle (2000) will provide a solid foundation for thisformidable task This alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1463
possibility of taking into account information about the ldquoabnormalrdquo class(Scholkopf Platt amp Smola 2000) is subject of current research
Appendix A Supplementary Material for Section 2
A1 Estimating the Support of a Density The problem of estimatingC(1) appears to have rst been studied by Geffroy (1964) who consideredX D R2 with piecewise constant estimators There have been a number ofworks studying a natural nonparametric estimator of C(1) (eg Chevalier1976 Devroye amp Wise 1980 see Gayraud 1997 for further references) Thenonparametric estimator is simply
OC` D[
iD1
B(xi 2 `) (A1)
where B(x 2 ) is the l2( X ) ball of radius 2 centered at x and ( 2 `)` is an appro-priately chosen decreasing sequence Devroye amp Wise (1980) showed theasymptotic consistency of (A1) with respect to the symmetric differencebetween C(1) and OC Cuevas and Fraiman (1997) studied the asymptoticconsistency of a plug-in estimator of C(1) OCplug-in D
copyx Op`(x) gt 0
ordfwhere
Op` is a kernel density estimator In order to avoid problems with OCplug-inthey analyzed
OCplug-inb D
copyx Op`(x) gt b`
ordf where (b`)` is an appropriately chosen se-
quence Clearly for a given distribution a is related to b but this connectioncannot be readily exploited by this type of estimator
The most recent work relating to the estimation of C(1) is by Gayraud(1997) who has made an asymptotic minimax study of estimators of func-tionals of C(1) Two examples are vol C(1) or the center of C(1) (See alsoKorostelev amp Tsybakov 1993 chap 8)
A2 Estimating High Probability Regions (a 6D 1) Polonik (1995b) hasstudied the use of the ldquoexcess mass approachrdquo (Muller 1992) to constructan estimator of ldquogeneralized a-clustersrdquo related to C(a)
Dene the excess mass over C at level a as
E C (a) D sup fHa(C) C 2 C g
where Ha(cent) D (P iexcl al)(cent) and again l denotes Lebesgue measure Any setC C (a) 2 C such that
E C (a) D Ha (C C (a) )
is called a generalized a-cluster in C Replace P by P` in these denitions toobtain their empirical counterparts E C (a) and C C (a) In other words hisestimator is
C C (a) D arg max f(P` iexcl al)(C) C 2 C g
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1464 Bernhard Scholkopf et al
where the max is not necessarily unique Now while C C (a) is clearly dif-ferent from C`(a) it is related to it in that it attempts to nd small regionswith as much excess mass (which is similar to nding small regions with agiven amount of probability mass) Actually C C (a) is more closely relatedto the determination of density contour clusters at level a
cp (a) D fx p(x) cedil ag
Simultaneously and independently Ben-David amp Lindenbaum (1997)studied the problem of estimating cp (a) They too made use of VC classesbut stated their results in a stronger form which is meaningful for nitesample sizes
Finally we point out a curious connection between minimum volumesets of a distribution and its differential entropy in the case that X is one-dimensional Suppose X is a one-dimensional random variable with densityp Let S D C(1) be the support of p and dene the differential entropy of Xby h(X) D iexcl
RS p(x) log p(x)dx For 2 gt 0 and ` 2 N dene the typical set
A( )2 with respect to p by
A( )2 D
copy(x1 x`) 2 S` | iexcl 1
`log p(x1 x`) iexcl h(X) | middot 2
ordf
where p(x1 x`) DQ`
iD1 p(xi) If (a`)`and (b`)`are sequences thenotationa`
D b` means lim`11` log a`
b`D 0 Cover and Thomas (1991) show that
for all 2 d lt 12 then
vol A( )2
D vol C`(1 iexcld ) D 2 h
They point out that this result ldquoindicates that the volume of the smallestset that contains most of the probability is approximately 2 h This is a -dimensional volume so the corresponding side length is (2 h )1` D 2h Thisprovides an interpretation of differential entropyrdquo (p 227)
Appendix B Proofs of Section 5
Proof (Proposition 1) Due to the separability the convex hull of thedata does not contain the origin The existence and uniqueness of the hy-perplane then follow from the supporting hyperplane theorem (Bertsekas1995)
Moreover separability implies that there actually exists some r gt 0and w 2 F such that (w cent xi ) cedil r for i 2 [ ] (by rescaling w this canbe seen to work for arbitrarily large r) The distance of the hyperplanefz 2 F (w centz) D rg to the origin is r kwk Therefore the optimal hyperplaneis obtained by minimizing kwk subject to these constraints that is by thesolution of equation 53
Proof (Proposition 2) Ad (i) By construction the separation of equa-tion 56 is a point-symmetric problem Hence the optimal separating hyper-plane passes through the origin for if it did not we could obtain another
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1465
optimal separating hyperplane by reecting the rst one with respect tothe origin this would contradict the uniqueness of the optimal separatinghyperplane (Vapnik 1995)
Next observe that (iexclw r ) parameterizes the supporting hyperplane forthe data set reected through the origin and that it is parallel to the onegiven by (w r ) This provides an optimal separation of the two sets withdistance 2r and a separating hyperplane (w 0)
Ad (ii) By assumption w is the shortest vector satisfying yi(w cent xi) cedil r
(note that the offset is 0) Hence equivalently it is also the shortest vectorsatisfying (w cent yixi ) cedil r for i 2 [ ]
Proof Sketch (Proposition 3) When changing r the termP
i ji in equa-tion 34 will change proportionally to the number of points that have anonzero ji (the outliers) plus when changing r in the positive directionthe number of points that are just about to get a nonzero r that is which siton the hyperplane (the SVs) At the optimum of equation 34 we thereforehave parts i and ii Part iii can be proven by a uniform convergence argumentshowing that since the covering numbers of kernel expansions regularizedby a norm in some feature space are well behaved the fraction of pointsthat lie exactly on the hyperplane is asymptotically negligible (ScholkopfSmola et al 2000)
Proof (Proposition 4) Suppose xo is an outlier that is jo gt 0 henceby the KKT conditions (Bertsekas 1995) ao D 1(ordm ) Transforming it intox0
o D xo C d cent w where |d | lt jokwk leads to a slack that is still nonzero thatisj 0
o gt 0 hence we still have ao D 1 (ordm ) Therefore reg0 D reg is still feasibleas is the primal solution (w0 raquo0 r 0 ) Here we use j 0
i D (1 C d cent ao)ji for i 6D ow0 D (1 C d cent ao)w and r 0 as computed from equation 312 Finally the KKTconditions are still satised as still a0
o D 1 (ordm ) Thus (Bertsekas 1995) reg isstill the optimal solution
We now move on to the proof of theorem 1 We start by introducing acommon tool for measuring the capacity of a class F of functions that mapX to R
Denition 3 Let (X d) be a pseudometric space and let A be a subset of X and2 gt 0 A set U micro X is an 2 -cover for A if for every a 2 A there exists u 2 Usuch that d(a u) middot 2 The 2 -covering number of A N ( 2 A d) is the minimalcardinality of an 2 -cover for A (if there is no such nite cover then it is dened tobe 1)
Note that we have used ldquoless than or equal tordquo in the denition of a coverThis is somewhat unconventional but will not change the bounds we useIt is however technically useful in the proofs
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1466 Bernhard Scholkopf et al
The idea is that U should be nite but approximate all of A with respectto the pseudometric d We will use the l1 norm over a nite sample X D(x1 x`) for the pseudonorm on F
k f klX1D max
x2X| f (x) | (B2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way) SupposeX is a compact subset of X Let N ( 2 F ) D maxX2X ` N ( 2 F lX1 ) (themaximum exists by denition of the covering number and compactnessof X )
We require a technical restriction on the function class referred to assturdiness which is satised by standard classes such as kernel-based linearfunction classes and neural networks with bounded weights (see Shawe-Taylor amp Wiliamson 1999 and Scholkopf Platt Shawe-Taylor Smola ampWilliamson 1999 for further details)
Below the notation dte denotes the smallest integer cedil t Several of theresults stated claim that some event occurs with probability 1 iexcl d In allcases 0 lt d lt 1 is understood and d is presumed in fact to be small
Theorem 2 Consider any distribution P on X and a sturdy real-valued functionclass F on X Suppose x1 x` are generated iid from P Then with probability1 iexcl d over such an -sample for any f 2 F and for any c gt 0
Praquo
x f (x) lt mini2[ ]
f (xi) iexcl 2c
frac14middot 2 ( k d) D 2
`(k C log `
d)
where k D dlog N (c F 2 )e
The proof which is given in Scholkopf Platt et al (1999) uses standardtechniques of VC theory symmetrization via a ghost sample application ofthe union bound and a permutation argument
Although theorem 2 provides the basis of our analysis it suffers froma weakness that a single perhaps unusual point may signicantly reducethe minimum mini2[ ] f (xi ) We therefore now consider a technique that willenable us to ignore some of the smallest values in the set f f (xi)xi 2 Xg ata corresponding cost to the complexity of the function class The techniquewas originally considered for analyzing soft margin classication in Shawe-Taylor and Cristianini (1999) but we will adapt it for the unsupervisedproblem we are considering here
We will remove minimal points by increasing their output values Thiscorresponds to having a nonzero slack variable ji in the algorithm wherewe use the class of linear functions in feature space in the application of thetheorem There are well-known bounds for the log covering numbers of thisclass We measure the increase in terms of D (denition 2)
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1467
Let X be an inner product space The following denition introduces anew inner product space based on X
Denition 4 Let L( X ) be the set of real valued nonnegative functions f on Xwith support supp( f ) countable that is functions in L( X ) are nonzero for at mostcountably many points We dene the inner product of two functions f g 2 L( X )by
f cent g DX
x2supp( f )
f (x)g(x)
The 1-norm on L( X ) is dened by k fk1 DP
x2supp( f ) f (x) and let LB ( X ) D f f 2L( X ) k fk1 middot Bg Now we dene an embedding of X into the inner product spaceX pound L( X ) via t X X pound L( X ) t x 7 (x dx ) where dx 2 L( X ) is denedby
dx(y) Draquo
1 if y D xI0 otherwise
For a function f 2 F a set of training examples X and h 2 R we dene thefunction gf 2 L( X ) by
gf (y) D gXhf (y) D
X
x2X
d(x f h )dx(y)
Theorem 3 Fix B gt 0 Consider a xed but unknown probability distributionP which has no atomic components on the input space X and a sturdy class of real-valued functions F Then with probability 1 iexcl d over randomly drawn trainingsequences X of size for all c gt 0 and any f 2 F and any h such that gf D gXh
f 2LB ( X ) (ie
Px2X d(x f h ) middot B)
Pcopyx f (x) lt h iexcl 2c
ordfmiddot 2
`(k C log `
d) (B3)
where k Dsectlog N (c 2 F 2 ) C log N (c 2 LB ( X ) 2 )
uml
The assumption on P can in fact be weakened to require only that there isno point x 2 X satisfying f (x) lt h iexcl 2c that has discrete probability) Theproof is given in Scholkopf Platt et al (1999)
The theorem bounds theprobability of a new example falling in the regionfor which f (x) has value less than h iexcl 2c this being the complement of theestimate for the support of the distribution In the algorithm described inthis article one would use the hyperplane shifted by 2c toward the originto dene the region Note that there is no restriction placed on the class offunctions though these functions could be probability density functions
The result shows that we can bound the probability of points falling out-side the region of estimated support by a quantity involving the ratio of the
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1468 Bernhard Scholkopf et al
log covering numbers (which can be bounded by the fat-shattering dimen-sion at scale proportional to c ) and the number of training examples plus afactor involving the 1-norm of the slack variables The result is stronger thanrelated results given by Ben-David and Lindenbaum (1997) their bound in-volves the square root of the ratio of thePollarddimension (the fat-shatteringdimension when c tends to 0) and the number of training examples
In the specic situation considered in this article where the choice offunctionclass is theclass of linear functions in a kernel-dened feature spaceone can give a more practical expression for the quantity k in theorem 3To this end we use a result of Williamson Smola and Scholkopf (2000)to bound log N (c 2 F 2 ) and a result of Shawe-Taylor and Cristianini(2000) to bound log N (c 2 LB( X ) 2 ) We apply theorem 3 stratifying overpossible values of B The proof is given in Scholkopf Platt et al (1999) andleads to theorem 1 stated in the main part of this article
Acknowledgments
This work was supported by the ARC the DFG (Ja 3799-1 and Sm 621-1) and by the European Commission under the Working Group Nr 27150(NeuroCOLT2) Parts of it were done while B S and A S were with GMDFIRST Berlin Thanks to S Ben-David C Bishop P Hayton N Oliver CSchnorr J Spanjaard L Tarassenko and M Tipping for helpful discussions
References
Ben-David S amp Lindenbaum M (1997) Learning distributions by their densitylevels A paradigm for learning without a teacher Journal of Computer andSystem Sciences 55 171ndash182
Bertsekas D P (1995) Nonlinear programming Belmont MA Athena ScienticBoser B E Guyon I M amp Vapnik V N (1992) A training algorithm for
optimal margin classiers In D Haussler (Ed) Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory (pp 144ndash152) PittsburghPA ACM Press
Chevalier J (1976) Estimation du support et du contour du support drsquoune loi deprobabilit Acirce Annales de lrsquoInstitutHenri Poincar Acirce Section B Calcul des Probabilit Acirceset Statistique Nouvelle S Acircerie 12(4) 339ndash364
Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley
Cuevas A amp Fraiman R (1997) A plug-in approach to support estimationAnnals of Statistics 25(6) 2300ndash2312 1997
Devroye L amp Wise G L (1980) Detection of abnormal behaviour via nonpara-metric estimation of the support SIAM Journal on Applied Mathematics 38(3)480ndash488
Einmal J H J amp Mason D M (1992) Generalized quantile processes Annalsof Statistics 20(2) 1062ndash1078
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1469
Gayraud G (1997) Estimation of functional of density support MathematicalMethods of Statistics 6(1) 26ndash46
Geffroy J (1964) Sur un probleme drsquoestimation g Acirceom Acircetrique Publications delrsquoInstitut de Statistique de lrsquoUniversitAcirce de Paris 13 191ndash210
Girosi F (1998) An equivalence between sparse approximation and supportvector machines Neural Computation 10(6) 1455ndash1480
Hartigan J A (1987) Estimation of a convex density contour in two dimensionsJournal of the American Statistical Association 82 267ndash270
Hayton P Scholkopf B Tarassenko L amp Anuzis P (in press) Support vectornovelty detection applied to jet engine vibration spectra In T Leen T Di-etterich amp V Tresp (Eds) Advances in neural information processing systems13
Joachims T (1999) Making large-scale SVM learning practical In B ScholkopfC J C Burges amp A J Smola (Eds) Advances in kernel methodsmdashSupport vectorlearning (pp 169ndash184) Cambridge MA MIT Press
Keerthi S S Shevade S K Bhattacharyya C amp Murthy K R K (1999)Improvements to Plattrsquos SMO algorithm for SVM classier design (TechRep No CD-99-14) Singapore Department of Mechanical and ProductionEngineering National University of Singapore
Korostelev A P amp Tsybakov A B (1993) Minimax theory of image reconstructionNew York Springer-Verlag
Muller D W (1992) The excess mass approach in statistics Heidelberg Beitragezur Statistik Universitat Heidelberg
Nolan D (1991) The excess mass ellipsoid Journal of Multivariate Analysis 39348ndash371
Platt J (1999) Fast training of support vector machines using sequential mini-mal optimization In B Scholkopf C J C Burges amp A J Smola (Eds) Ad-vances in kernel methodsmdashSupport vector learning (pp 185ndash208) CambridgeMA MIT Press
Poggio T amp Girosi F (1990) Networks for approximation and learning Pro-ceedings of the IEEE 78(9)
Polonik W (1995a) Density estimation under qualitative assumptions in higherdimensions Journal of Multivariate Analysis 55(1) 61ndash81
Polonik W (1995b) Measuring mass concentrations and estimating densitycontour clustersmdashan excess mass approach Annals of Statistics 23(3) 855ndash881
Polonik W (1997) Minimum volume sets and generalized quantile processesStochastic Processes and their Applications 69 1ndash24
Sager T W (1979) An iterative method for estimating a multivariate mode andisopleth Journal of the American Statistical Association 74(366) 329ndash339
Scholkopf B Burges C J C amp Smola A J (1999) Advances in kernel methodsmdashSupport vector learning Cambridge MA MIT Press
Scholkopf B Burges C amp Vapnik V (1995) Extracting support data for a giventask In U M Fayyad amp R Uthurusamy (Eds) Proceedings First InternationalConference on Knowledge Discovery and Data Mining Menlo Park CA AAAIPress
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
1470 Bernhard Scholkopf et al
Scholkopf B Platt J Shawe-Taylor J Smola A J amp Williamson R C(1999) Estimating the support of a high-dimensional distribution (TechRep No 87) Redmond WA Microsoft Research Available online athttpwwwresearchmicrosoftcomscriptspubsviewaspTR ID=MSR-TR-99-87
Scholkopf B Platt J amp Smola A J (2000) Kernel method for percentile featureextraction (Tech Rep No 22) Redmond WA Microsoft Research
Scholkopf B Smola A amp Muller K R (1999) Kernel principal componentanalysis In B Scholkopf C J C Burges amp A J Smola (Eds) Advances inkernel methodsmdashSupport vector learning (pp 327ndash352) Cambridge MA MITPress
Scholkopf B Smola A Williamson R C amp Bartlett P L (2000) New supportvector algorithms Neural Computation 12 1207ndash1245
Scholkopf B Williamson R Smola A amp Shawe-Taylor J (1999) Single-classsupport vector machines In J Buhmann W Maass H Ritter amp N Tishbyeditors (Eds) Unsupervised learning (Rep No 235) (pp 19ndash20) DagstuhlGermany
Shawe-Taylor J amp Cristianini N (1999) Margin distribution bounds on gen-eralization In Computational Learning Theory 4th European Conference (pp263ndash273) New York Springer-Verlag
Shawe-Taylor J amp Cristianini N (2000) On the generalisation of soft marginalgorithms IEEE Transactions on Information Theory Submitted
Shawe-Taylor J amp Williamson R C (1999) Generalization performance ofclassiers in terms of observed covering numbers In Computational LearningTheory 4th European Conference (pp 274ndash284) New York Springer-Verlag
Smola A amp Scholkopf B (in press) A tutorial on support vector regressionStatistics and Computing
Smola A Scholkopf B amp Muller K-R (1998) The connection between regu-larization operators and support vector kernels Neural Networks 11 637ndash649
Smola A Mika S Scholkopf B amp Williamson R C (in press) Regularizedprincipal manifolds Machine Learning
Still S amp Scholkopf B (in press) Four-legged walking gait control using a neu-romorphic chip interfaced to a support vector learning algorithm In T LeenT Diettrich amp P Anuzis (Eds) Advances in neural information processing sys-tems 13 Cambridge MA MIT Press
Stoneking D (1999) Improving the manufacturability of electronic designsIEEE Spectrum 36(6) 70ndash76
Tarassenko L Hayton P Cerneaz N amp Brady M (1995) Novelty detectionfor the identication of masses in mammograms In Proceedings Fourth IEE In-ternational Conference on Articial Neural Networks (pp 442ndash447) Cambridge
Tax D M J amp Duin R P W (1999) Data domain description by support vectorsIn M Verleysen (Ed) Proceedings ESANN (pp 251ndash256) Brussels D Facto
Tsybakov A B (1997) On nonparametric estimation of density level sets Annalsof Statistics 25(3) 948ndash969
Vapnik V (1995) The nature of statistical learning theory New York Springer-Verlag
Vapnik V (1998) Statistical learning theory New York Wiley
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000
Estimating the Support of a High-Dimensional Distribution 1471
Vapnik V amp Chapelle O (2000) Bounds on error expectation for SVM In A JSmola P L Bartlett B Scholkopf amp D Schuurmans (Eds) Advances in largemargin classiers (pp 261 ndash 280) Cambridge MA MIT Press
Vapnik V amp Chervonenkis A (1974) Theory of pattern recognition NaukaMoscow [In Russian] (German translation W Wapnik amp A TscherwonenkisTheorie der Zeichenerkennung AkademiendashVerlag Berlin 1979)
Vapnik V amp Lerner A (1963) Pattern recognition using generalized portraitmethod Automation and Remote Control 24
Williamson R C Smola A J amp Scholkopf B (1998) Generalization perfor-mance of regularization networks and support vector machines via entropy num-bers of compact operators (Tech Rep No 19) NeuroCOLT Available onlineat httpwwwneurocoltcom 1998 Also IEEE Transactions on InformationTheory (in press)
Williamson R C Smola A J amp Scholkopf B (2000) Entropy numbers of linearfunction classes In N Cesa-Bianchi amp S Goldman (Eds) Proceedings of the13th Annual Conference on Computational Learning Theory (pp 309ndash319) SanMateo CA Morgan Kaufman
Received November 19 1999 accepted September 22 2000