a quick introduction to statistical learning theory and support vector machines

288
A Quick Introduction to Statistical Learning Theory and Support Vector Machines Rahul B. Warrier Abstract— This article gives a brief introduction to Statis- tical Learning Theory (SLT) and the relatively new class of supervised learning algorithms called Support Vector Machines (SVM). The primary objective of this article is to understand the chosen topic using the concepts of linear systems theory discussed in class. To that effect, a thorough mathematical anal- ysis of the topic is done along with some examples that vividly illustrate the concepts being discussed. Finally, application of SVM to a simple binary classification problem is discussed. I. INTRODUCTION Statistical Learning Theory (SLT) forms the basic frame- work for machine learning. Of the several categories of learning, SLT deals with supervised learning, which involves learning from a training set of data (examples). The training data consists of an input-output pair, where the input maps to an output. The learning problem then, consists of finding the function that maps input to output in a predictive fash- ion, such that the learned function can be used to predict output from a new set of input data with minimal mistakes (generalization) [6]. Support Vector Machine (SVM) is a class of supervised learning machines for two-group classification/ regression analysis problems. Conceptually, SVM implements the fol- lowing: input vectors are non-linearly mapped to a high- dimensional feature space wherein a linear decision surface is constructed [4]. The goal of this article is to briefly describe the mathe- matical formulation of SLT and more specifically the SVM (applied to the binary classification problem) through the concepts of systems theory. The major concepts that will be highlighted during the course of this discussion are: inner product spaces and normed spaces, Gram matrices, linear/nonlinear mappings, convex optimization, projection theorem on Hilbert spaces and the minimax problem. This subject has gained immense popularity in the last few years and ample amount of literature is available on various aspects of this topic. This article mainly follows the seminal work of Vladimir N. Vapnik [1] and his colleagues at AT&T Bell Laboratories [2],[3]. STL concepts have also been discussed succinctly in [6] and a detailed mathematical formulation of STL and SVM is available in the dissertation by Sch¨ olkopf, [4]. Geometric interpretation of the SVM algorithm is presented in [8],[9]. This article is organized as follows: Section 2 deals with the formulation of the general learning problem. The Em- pirical Risk Minimization Induction (ERM) principle is dis- Rahul B. Warrier is a Graduate Student in Mechanical Engineering, University of Washington, Seattle, WA-98105 [email protected] cussed, leading to the Structural Risk Minimization Induction (SRM) principle. Section 3 discusses the problem of binary classification for linearly separable data and the concept of margin (and the essence of SVM - margin maximization). Section 4 introduces the Support Vector Learning algorithm for linearly separable data. The methodology of the SVM is then extended to data that is not fully linearly separable. We also look at the Kernal Trick which allows us to use the SVM to classify nonlinear data. Finally, in Section 5 we look at an application of SVMs to a simple binary classification problem. II. STATISTICAL LEARNING THEORY Statistical Learning Theory mainly deals with the problem of supervised learning. This model of learning involves learning from examples and can be described as follows [1]: 1) (Training data): An independent identically distributed set of data,{x n }, drawn from a fixed but unknown distribution P(x), 2) (Labeling of data by a Supervisor): Supervisor labels each input vector x, with an output vector y, according to a conditional distribution function P(y|x), also fixed but unknown. 3) (Learning machine): Learning machine that can imple- ment a set of functions f (x, α ), α F (α is a vector of scalar parameters). The goal of supervised learning is to select from among these functions (defined by the specific parameter vector) the one which predicts the supervisor’s response in the best possible way. This selection is based on the training data set: (x 1 , y 1 ),..., (x n , y n ) (1) A. General setting of the Learning Problem In order to choose the best approximation to the su- pervisor’s response, we consider the loss or discrepancy, L(y , f (x, α )), α F , between the supervisor’s response, y and the learning machine’s prediction f (x, α ) with input x. The goal then is to minimize the expected value of the loss, the risk functional R(α )= Z L(y , f (x, α )) dP(x, y), α F (2) given the training data (1). B. Empirical Risk Minimization Induction (ERM) Principle The problem of minimizing the risk functional, (2), given the training data, (1), is usually solved using the induction

Upload: rahul-b-warrier

Post on 26-Dec-2015

68 views

Category:

Documents


2 download

DESCRIPTION

This article gives a brief introduction to Statistical Learning Theory (SLT) and the relatively new class of supervised learning algorithms called Support Vector Machines (SVM). The primary objective of this article is to understand the chosen topic using the concepts of linear systems theory discussed in class. To that effect, a thorough mathematical analysis of the topic is done along with some examples that vividly illustrate the concepts being discussed. Finally, application of SVM to a simple binary classification problem is discussed.

TRANSCRIPT

Page 1: A quick introduction to Statistical Learning Theory and Support Vector Machines

A Quick Introduction to Statistical Learning Theory and SupportVector Machines

Rahul B. Warrier

Abstract— This article gives a brief introduction to Statis-tical Learning Theory (SLT) and the relatively new class ofsupervised learning algorithms called Support Vector Machines(SVM). The primary objective of this article is to understandthe chosen topic using the concepts of linear systems theorydiscussed in class. To that effect, a thorough mathematical anal-ysis of the topic is done along with some examples that vividlyillustrate the concepts being discussed. Finally, application ofSVM to a simple binary classification problem is discussed.

I. INTRODUCTION

Statistical Learning Theory (SLT) forms the basic frame-work for machine learning. Of the several categories oflearning, SLT deals with supervised learning, which involveslearning from a training set of data (examples). The trainingdata consists of an input-output pair, where the input mapsto an output. The learning problem then, consists of findingthe function that maps input to output in a predictive fash-ion, such that the learned function can be used to predictoutput from a new set of input data with minimal mistakes(generalization) [6].

Support Vector Machine (SVM) is a class of supervisedlearning machines for two-group classification/ regressionanalysis problems. Conceptually, SVM implements the fol-lowing: input vectors are non-linearly mapped to a high-dimensional feature space wherein a linear decision surfaceis constructed [4].

The goal of this article is to briefly describe the mathe-matical formulation of SLT and more specifically the SVM(applied to the binary classification problem) through theconcepts of systems theory. The major concepts that willbe highlighted during the course of this discussion are:inner product spaces and normed spaces, Gram matrices,linear/nonlinear mappings, convex optimization, projectiontheorem on Hilbert spaces and the minimax problem.

This subject has gained immense popularity in the lastfew years and ample amount of literature is available onvarious aspects of this topic. This article mainly follows theseminal work of Vladimir N. Vapnik [1] and his colleaguesat AT&T Bell Laboratories [2],[3]. STL concepts have alsobeen discussed succinctly in [6] and a detailed mathematicalformulation of STL and SVM is available in the dissertationby Scholkopf, [4]. Geometric interpretation of the SVMalgorithm is presented in [8],[9].

This article is organized as follows: Section 2 deals withthe formulation of the general learning problem. The Em-pirical Risk Minimization Induction (ERM) principle is dis-

Rahul B. Warrier is a Graduate Student in Mechanical Engineering,University of Washington, Seattle, WA-98105 [email protected]

cussed, leading to the Structural Risk Minimization Induction(SRM) principle. Section 3 discusses the problem of binaryclassification for linearly separable data and the concept ofmargin (and the essence of SVM - margin maximization).Section 4 introduces the Support Vector Learning algorithmfor linearly separable data. The methodology of the SVMis then extended to data that is not fully linearly separable.We also look at the Kernal Trick which allows us to use theSVM to classify nonlinear data. Finally, in Section 5 we lookat an application of SVMs to a simple binary classificationproblem.

II. STATISTICAL LEARNING THEORY

Statistical Learning Theory mainly deals with the problemof supervised learning. This model of learning involveslearning from examples and can be described as follows [1]:

1) (Training data): An independent identically distributedset of data,{xn}, drawn from a fixed but unknowndistribution P(x),

2) (Labeling of data by a Supervisor): Supervisor labelseach input vector x, with an output vector y, accordingto a conditional distribution function P(y|x), also fixedbut unknown.

3) (Learning machine): Learning machine that can imple-ment a set of functions f (x,α),α ∈F (α is a vectorof scalar parameters). The goal of supervised learningis to select from among these functions (defined by thespecific parameter vector) the one which predicts thesupervisor’s response in the best possible way. Thisselection is based on the training data set:

(x1,y1), . . . ,(xn,yn) (1)

A. General setting of the Learning Problem

In order to choose the best approximation to the su-pervisor’s response, we consider the loss or discrepancy,L(y, f (x,α)),α ∈ F , between the supervisor’s response, yand the learning machine’s prediction f (x,α) with input x.The goal then is to minimize the expected value of the loss,the risk functional

R(α) =∫

L(y, f (x,α)) dP(x,y), α ∈F (2)

given the training data (1).

B. Empirical Risk Minimization Induction (ERM) Principle

The problem of minimizing the risk functional, (2), giventhe training data, (1), is usually solved using the induction

Page 2: A quick introduction to Statistical Learning Theory and Support Vector Machines

principle1.The expected risk functional R(α) given by (2) is replaced

by the empirical risk functional [1],

Remp(α) =1n

n

∑i=1

L(y, f (x,α)) (3)

which is constructed using the training data, (1).The Empirical Risk Minimization Induction (ERM) principleapproximates the function L(y, f (x,α0)), which minimizesthe risk (2), by the function L(y, f (x,αn)), which minimizesthe empirical risk (3).

Now, SLT provides probabilistic bounds on the discrep-ancy between the empirical and expected risk of any function[4]: if h < n is the VC dimension2 of the class of functionsthat the learning machine can implement, then for all func-tions of that class, with a probability of at least 1−η , thebound

R(α)< Remp(α)+φ

(hn,

log(η)

n

)(4)

holds, and the confidence term φ is defined as

φ

(hn,

log(η)

n

)=

√h(log( 2n

h )+1)− log(η/4)

n. (5)

Here h is the capacity of the function space, f (x,α) and nthe number of training data.

The ERM principle cannot, in practice, be applied directly[6]. Firstly, there can be infinitely many functions fromthe class of functions, f (x,α) that minimize the empiricalrisk. Secondly, it can lead to overfitting which means thateven if we get the empirical risk Remp(α) → 0 using alarge VC-dimension, h, the expected risk, R(α), due to themonotonically increasing confidence term, can be very large.This implies generalization to new data is not guaranteed.

Further, the space of functions f (x,α) is in practice verylarge3, so one generally considers smaller hypothesis spacesH , [6]. And, looking at (4), we find that to minimize R(α),instead of simply minimizing Remp(α), good generalizationis achieved by trying to find the best trade off between theempirical risk and the complexity of the function space givenby the second term in the inequality (4)[1]. This leads to theStructural Risk Minimization Induction Principle.

C. Structural Risk Minimization Induction (SRM) PrincipleThe SRM principle involves defining a nested sequence

of hypothesis spaces H1 ⊂H2 ⊂ ·· · ⊂HN such that theircorresponding capacities are finite and are ordered in increas-ing fashion as h1 ≤ h2 ≤ ·· · ≤ hN < ∞. The idea then is tochoose the function f (x,α0) that minimizes the empiricalrisk (3) in the hypothesis space Hn∗ for which the bound onthe structural risk (4) is minimized.

1Transduction principle is also used in many instances to reduce thenumber of labeled training data required. This scheme is called activelearning [10]

2Numerous definitions of capacity quantities exist, most popular onebeing the VC-dimension [1]. For the binary classification problem it isdefined as the largest number of points that can be separated into two classesin all possible 2h ways using the functions of the learning machine.

3We can have f (x,α) = L2, the space of square integrable continuousfunctions.

Fig. 1. Pictorial representation of the structure of Hypothesis spaces [4]

D. Summary of STL

The problem of learning from examples is solved in threesteps:

1) Define a loss function L(y, f (x,α)) that measures theerror of prediction of output f (x,α) of input x whenactual output is y;

2) Define a nested sequence of hypothesis spaces Hn,n =1, . . . ,N whose capacity hn increases with n;

3) Minimize the empirical risk Remp(α) in each Hn andchoose among the solutions the one that minimizes theright hand side of the inequality (4).

Fig. 2. Graphical representation of (4) for fixed n [4]

The bound given by (4) forms part of the theoretical basis[4] for the Support Vector Learning algorithm described inSection 4.

III. THE BINARY CLASSIFICATION PROBLEM

A. Problem Setup

Given n input vectors, xi ∈X =Rd , i = 1, . . . ,n and theircorresponding labels, yi ∈ Y = {+1,−1}, i = 1, . . . ,n, all ofwhich are identically and independently distributed accord-ing to some probability distribution P(x,y) = P(x) ·P(y|x),the goal is to find a decision function f (x,α0),α0 ∈ F :

Page 3: A quick introduction to Statistical Learning Theory and Support Vector Machines

X →{+1,−1} that will predict the correct label

yt = maxy∈{−1,+1}

P(y|xt)

for a test example xt .

Fig. 3. Binary Classification with a decision function (potentially non-linear in the input space)

The search for this optimal prediction function f (x,α0) iscarried out in a structured Hypothesis space using the SRMand ERM principles discussed in the previous section.

The loss function used for binary classification problemsis the zero-one loss function:

L(y, f (x,α)) = |y− f (x,α)| (6)

in which case the expected risk (2) is simply the probabilityof misclassification.

R(α) =∫ 1

2|y− f (x,α)| dP(x,y), α ∈F

and the empirical risk is given by,

Remp(α) =1n

n

∑i=1

12|yi− f (xi,α)| dP(x,y), α ∈F

B. Linearly Separable Classification

In this section, we assume that X is a Euclidean Hilbertspace with inner product defined.4.

The data given in (1) with yi ∈ {+1,−1} is said to belinearly separable if there exists a vector w and a scalar bsuch that the following inequalities hold for all the elementsof the training set X ,

< w,xi >+b≥ 1 if yi =+1 (7a)< w,xi >+b≤−1 if yi =−1, (7b)

or equivalently,

yi(< w,xi >+b)≥ 1 (8)

A hyperplane is given by the following equation:

< w,x >+b = 0 (9)

4In Section 4 we see that even if X is not a Hilbert space, we can replacethe inner product with a Kernel function according to Mercer’s Theorem.

A hyperplane that satisfies (8) is called a decision boundary,and the decision function, f : X → Y is simply

f (x,w) = sign(< w,x >+b) (10)

Hence, we notice that a decision boundary is an affinesubspace, with w as the normal vector to the decisionboundary, and b the bias and from the example shown inFig. 5(a) we see that there can be infinitely many possibledecision boundaries.

In the case that the training data is non-separable as shownin Fig. 5(b), it is possible to map the data to a higherdimensional Hilbert space called a Feature space, P througha non-linear mapping Ψ : X →P such that we can find alinear decision function, as shown in Fig. 5(c). The decisionfunction is then,

f (x,w) = sign(< w,Ψ(x)>+b) (11)

IV. SUPPORT VECTOR MACHINES

The distance of a vector to a hyperplane is given by:

d(xi) =|< w,xi >+b|

‖w‖The margin between the two classes is defined as the distanceof the hyperplane to the closest vectors from each of the twoclasses. These vectors are called the support vectors. Thus,we can express the margin as the minimum of this distanceconsidering each of the xi ∈X ,

M = minxi∈X

d(xi) = minxi∈X

|< w,xi >+b|‖w‖

. (12)

Fig. 4. Maximum margin hyperplane using the support vectors (encircled)to define the optimal margin.

A. Primal Form

Of all the possible decision boundaries (hyperplanes), theoptimal hyperplane is the one that maximizes the marginbetween the vectors of the two classes. Thus, the search

Page 4: A quick introduction to Statistical Learning Theory and Support Vector Machines

(a) (b) (c)

Fig. 5. (a) 2D separable training space with possible decision boundaries. (b) Non-separable training space. (c) 2D feature space with non-lineartransformation Ψ(x) = (x,x2), with possible decision boundaries.

for the maximal margin hyperplane leads to a minimaxoptimization problem:

maxw,b

minxi∈X

|< w,xi >+b|‖w‖

(13)

subject to yi(< w,xi >+b)≥ 1, ∀(xi,yi) ∈Z = X ×Y

We see from 4 that the optimal margin is given by 2‖w‖ . Thus,

(13) can be reformulated as

maxw,b

2‖w‖

(14)

subject to yi(< w,xi >+b)≥ 1, ∀(xi,yi) ∈X ×Y

This is equivalent to the problem,

minw,b

12‖w‖2 (15)

subject to yi(< w,xi >+b)≥ 1, ∀(xi,yi) ∈X ×Y

which is a quadratic programming (QP) problem and is calledthe Primal form. The solution of (15) gives us the optimalparameters of the hyperplane which gives us the optimaldecision function:

f (x,w∗) = sign(〈w∗,x〉+b∗) (16)

For non-separable data, if a map Ψ is known that trans-forms the non-separable training space to a separable featurespace, then the QP problem (15) becomes,

minw,b

12‖w‖2 (17)

subject to yi(< w,Ψ(xi)>+b)≥ 1, ∀(xi,yi) ∈X ×Y

and the decision function (16) becomes,

f (x,w∗) = sign(< w∗,Ψ(x)>+b∗) (18)

This QP problem is known as the Primal form and canbe solved efficiently for linearly separable training data ortraining spaces with a known map Ψ. An example of asolution on a training dataset of 40 random linearly separablevectors is illustrated in Fig. 4.

B. Dual Form

The primal form discussed previously is sufficient forlinearly separable training data or for data in which a map Ψ

is known to obtain a separable feature space. For the case ofnon-separable data for which such a map Ψ is unknown,we need to formulate the Dual form for which the costfunctional does not depend on the hyperplane parameters.Also, in general the dual form of the QP problem is easierto solve numerically.

The dual form is obtained by constructing the Lagrangianof (15) using non-negative Lagrangian multipliers:

L (w,b,λ ) =12‖w‖2−

n

∑i=1

λi (yi(< w,xi >+b)−1) (19)

λi ≥ 0

The QP problem (15) is solved using the saddle point of theLagrangian - maximizing L (w,b,λ ) with respect to λ andminimizing L (w,b,λ ) with respect to (w,b). The minimumis given by:

∂L (w,b,λ )∂w

∣∣∣∣w=w∗

= w∗−n

∑i=1

λiyixi = 0 (20a)

∂L (w,b,λ )∂b

∣∣∣∣b=b∗

=n

∑i=1

λiyi = 0 (20b)

Thus, from (20a) we get, the solution for the optimalhyperplane as the linear combination of the training vectors:

w∗ =n

∑i=1

λiyixi (21)

Substituting (20b) and (21) into (19) and maximizing theLagrangian we get the dual form,

maxλ

n

∑i=1

λi−12

n

∑i=1

n

∑j=1

λiλ jyiy j⟨xi,x j

⟩(22)

subject ton

∑i=1

λiyi = 0, λi ≥ 0

According to the Kuhn-Tucker theorem of optimization[4], at the saddle point only those Lagrange multipliers λi

Page 5: A quick introduction to Statistical Learning Theory and Support Vector Machines

are non-zero which satisfy the equality constraint in (15),i.e.,

λ∗i (yi(< w∗,xi >+b∗)−1) = 0, for some i = 1, . . . ,n

(23)⇒ λ

∗i 6= 0 for yi(< w∗,xi >+b∗) = 1

The vectors corresponding to these λ ∗i > 0 are the supportvectors and it can be seen from (23) that they lie exactly onthe margin.

Thus, since the optimal hyperplane lies at equal distancesfrom the margin of the two classes, we have

< w∗,xA >+b∗

‖w∗‖=−

(< w∗,xB >+b∗

‖w∗‖

)Thus, we have the optimal bias,

b∗ =−12

n

∑i=1

λiyi [< xi,xA >+< xi,xB >] (24)

where xA and xB are support vectors from each of the twoclasses respectively. Thus, the decision function is

f (x,w) = sign(< w∗,x >+b∗) (25)

with w∗ given by (21) and b∗ given by (24).

C. The Kernel Trick

We now try to map non-separable data into separablehigh-dimensional (possibly infinite) feature spaces withoutexplicitly relying on a nonlinear map Ψ.

Define a functional K : X ×X →R such that it representsan inner product in some arbitrary feature space, H :

K(xi,x j) =⟨Ψ(xi),Ψ(x j)

⟩H

(26)

We can then represent the dual form (22) in the featurespace H by replacing all the inner products with the kernelfunction:

maxλ

n

∑i=1

λi−12

n

∑i=1

n

∑j=1

λiλ jyiy jK(xi,x j) (27)

subject ton

∑i=1

λiyi = 0, λi ≥ 0

or equivalently in matrix form,

maxλ

λ − λΓλ (28)

subject to λT y = 0, λi ≥ 0

where Γ is the Gram matrix defined as:

[Γ]i j = K(xi,x j) =⟨Ψ(xi),Ψ(x j)

⟩H

(29)

and, λ = [λ1y1, . . . ,λnyn].By Mercer’s Theorem5, we get the necessary and sufficient

conditions to represent the kernel K : X ×X → R as aninner product in some feature space H with a map Ψ as in(26).

5Mercer’s Theorem says that a symmetric function K(x1,x2) can beexpressed as an inner product, K(xi,x j) =

⟨Ψ(xi),Ψ(x j)

⟩for some Ψ if

and only if K(x1,x2) is positive semi-definite [2].

This procedure is termed as the kernel trick which lets ususe the SVM methodology on non-separable data as well.Although this is the case, the resulting decision functionobtained is not always robust and we have to settle witha soft margin hyperplane,[4] wherein classification of newdata is not error-proof.

Thus, using the kernel trick, the optimal decision function(25) becomes,

f (x,w) = sign

(n

∑i=1

λ∗i yiK(xi,x)+b∗

)Recall from (23) that λi 6= 0 only for the support vectors.Hence, we can simplify the decision function consideringonly xi ∈ SV , the set of support vectors:

f (x,w) = sign

(∑

xi∈SVλ∗i yiK(xi,x)+b∗

)(30)

where

b∗ =−12

n

∑i=1

λ∗i yi [K(xi,xA)+K(xi,xB)]

where xA and xB are support vectors from each of the twoclasses respectively.

D. Types of Kernel functions

The main types of kernel functions used in practice arethe following:

1) Linear: 〈x,x′〉.2) Polynomial: (γ〈x,x′〉+ r)d

3) RBF: exp(−γ|x− x′|2)4) Sigmoid: (tanh(γ〈x,x′〉+ r))

V. APPLICATION OF SVM - AN EXAMPLE

Fig. 6. 2D binary classification using a linear kernel SVM

A two-dimensional binary classification problem for atraining dataset containing 200 linearly separable vectors hasbeen considered. Following the methodology described in theprevious sections, an optimal hyperplane is constructed using

Page 6: A quick introduction to Statistical Learning Theory and Support Vector Machines

a linear kernel function after finding the support vectors. Theresult of the zero-error classification is shown in Fig.6 withthe support vectors encircled. The Sci-kit library [5] is usedto perform this simulation.

VI. CONCLUSIONIn the course of this discussion on Statistical Learning

Theory (STL) and the derived learning algorithm, SupportVector Machines (SVM) we have covered a variety ofconcepts discussed in class. The main concepts that wereused in the mathematical formulation of STL and SVM wereinner products, affine space/linear varieties, transformationmappings, convex optimization/quadratic programming, pro-jection theorem on Hilbert spaces (to find distance of supportvectors to the hyperplane) and the primal/dual formulation ofoptimization problems among many others. In totality, thisexercise has cemented the concepts discussed in class andhas also provided an opportunity to understand a populartopic of interest, namely, Support Vector Classification.

REFERENCES

[1] Vapnik, Vladimir N. ”An overview of statistical learning theory.”Neural Networks, IEEE Transactions on 10.5 (1999): 988-999.

[2] Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. ”Atraining algorithm for optimal margin classifiers.” Proceedings of thefifth annual workshop on Computational learning theory. ACM, 1992.

[3] Cortes, Corinna, and Vladimir Vapnik. ”Support-vector networks.”Machine learning 20.3 (1995): 273-297.

[4] Schlkopf, Bernhard. ”Support vector learning.” (1997).[5] Pedregosa, Fabian, et al. ”Scikit-learn: Machine learning in Python.”

The Journal of Machine Learning Research 12 (2011): 2825-2830.[6] Evgeniou, Theodoros, Massimiliano Pontil, and Tomaso Poggio. ”Sta-

tistical learning theory: A primer.” International Journal of ComputerVision 38.1 (2000): 9-13.

[7] Schlkopf, Bernhard. ”Statistical learning and kernel methods.” (2000).[8] Bennett, Kristin P., and Erin J. Bredensteiner. ”Duality and geometry

in SVM classifiers.” ICML. 2000.[9] Zhou, Dengyong, et al. ”Global geometry of SVM classifiers.” Institute

of Automation, Chinese Academy of Sciences, Tech. Rep. AI Lab(2002).

[10] Tong, Simon, and Daphne Koller. ”Support vector machine activelearning with applications to text classification.” The Journal ofMachine Learning Research 2 (2002): 45-66.

Page 7: A quick introduction to Statistical Learning Theory and Support Vector Machines

File: /media/udrive/Acad/Fall 2013/ME510/Project/svm.py Page 1 of 1

import numpy as npimport pylab as plfrom sklearn import svm

# we create 200 separable pointsnp.random.seed(0)X = np.r_[np.random.randn(100, 2) + [2, 2], np.random.randn(100, 2) - [2, 2]]Y = [0] * 100 + [1] * 100

# fit the modelclf = svm.SVC(kernel='linear')clf.fit(X, Y)

# get the optimal hyperplanew = clf.coef_[0]a = -w[0] / w[1]xx = np.linspace(-5, 5)yy = a * xx - (clf.intercept_[0]) / w[1]

# plot the margins that pass through the support vectorsb = clf.support_vectors_[0]yy_down = a * xx + (b[1] - a * b[0])b = clf.support_vectors_[-1]yy_up = a * xx + (b[1] - a * b[0])

# plot the line, the points, and the support vectors to the planepl.hot()pl.plot(xx, yy, 'k-')pl.plot(xx, yy_down, 'k--')pl.plot(xx, yy_up, 'k--')

pl.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=80, facecolors='none')pl.scatter(X[:, 0], X[:, 1], c=Y)

pl.axis('tight')pl.xlabel('x_1')pl.ylabel('x_2')pl.show()

UPCLab2013
Typewritten text
Python Source Code
UPCLab2013
Typewritten text
APPENDIX - A
Page 8: A quick introduction to Statistical Learning Theory and Support Vector Machines

988 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

An Overview of Statistical Learning TheoryVladimir N. Vapnik

Abstract—Statistical learning theory was introduced in the late1960’s. Until the 1990’s it was a purely theoretical analysis of theproblem of function estimation from a given collection of data.In the middle of the 1990’s new types of learning algorithms(called support vector machines) based on the developed theorywere proposed. This made statistical learning theory not onlya tool for the theoretical analysis but also a tool for creatingpractical algorithms for estimating multidimensional functions.This article presents a very general overview of statistical learningtheory including both theoretical and algorithmic aspects of thetheory. The goal of this overview is to demonstrate how theabstract learning theory established conditions for generalizationwhich are more general than those discussed in classical statis-tical paradigms and how the understanding of these conditionsinspired new algorithmic approaches to function estimation prob-lems. A more detailed overview of the theory (without proofs) canbe found in Vapnik (1995). In Vapnik (1998) one can find detaileddescription of the theory (including proofs).

I. SETTING OF THE LEARNING PROBLEM

I N this section we consider a model of the learning and showthat analysis of this model can be conducted in the general

statistical framework of minimizing expected loss using ob-served data. We show that practical problems such as patternrecognition, regression estimation, and density estimation areparticular case of this general model.

A. Function Estimation Model

The model of learning from examples can be describedusing three components:

1) a generator of random vectors, drawn independentlyfrom a fixed but unknown distribution ;

2) a supervisor that returns an output vectorfor everyinput vector , according to a conditional distributionfunction1 , also fixed but unknown;

3) a learning machine capable of implementing a set offunctions .

The problem of learning is that of choosing from the givenset of functions the one which predicts thesupervisor’s response in the best possible way. The selectionis based on a training set ofrandom independent identicallydistributed (i.i.d.) observations drawn according to

(1)

Manuscript received January 11, 1999; revised May 20, 1999.The author is with AT&T Labs-Research, Red Bank, NJ 07701 USA.Publisher Item Identifier S 1045-9227(99)07267-7.1This is the general case which includes a case where the supervisor uses

a functiony = f(x):

B. Problem of Risk Minimization

In order to choose the best available approximation to thesupervisor’s response, one measures theloss or discrepancy

between the response of the supervisor toa given input and the response provided by thelearning machine. Consider the expected value of the loss,given by therisk functional

(2)

The goal is to find the function which mini-mizes the risk functional (over the class of functions

in the situation where the joint probabil-ity distribution is unknown and the only availableinformation is contained in the training set (1).

C. Three Main Learning Problems

This formulation of the learning problem is rather general.It encompasses many specific problems. Below we considerthe main ones: the problems of pattern recognition, regressionestimation, and density estimation.

The Problem of Pattern Recognition:Let the supervisor’soutput take on only two values and let

be a set ofindicator functions (functionswhich take on only two values zero and one). Consider thefollowing loss-function:

ifif

(3)

For this loss function, the functional (2) provides the proba-bility of classification error (i.e., when the answersgivenby supervisor and the answers given by indicator function

differ). The problem, therefore, is to find the functionwhich minimizes the probability of classification errors whenprobability measure is unknown, but the data (1) aregiven.

The Problem of Regression Estimation:Let the supervi-sor’s answer be a real value, and let be aset of real functions which contains theregression function

It is known that if then the regression functionis the one which minimizes the functional (2) with the thefollowing loss-function:

(4)

Thus the problem of regression estimation is the problemof minimizing the risk functional (2) with the loss function

1045–9227/99$10.00 1999 IEEE

References

UPCLab2013
Typewritten text
REF. [1]
Page 9: A quick introduction to Statistical Learning Theory and Support Vector Machines

VAPNIK: OVERVIEW OF STATISTICAL LEARNING THEORY 989

(4) in the situation where the probability measure isunknown but the data (1) are given.

The Problem of Density Estimation:Finally, consider theproblem of density estimation from the set of densities

For this problem we consider the followingloss-function:

(5)

It is known that desired density minimizes the risk functional(2) with the loss-function (5). Thus, again, to estimate thedensity from the data one has to minimize the risk-functionalunder the condition where the corresponding probability mea-sure is unknown but i.i.d. data

are given.The General Setting of the Learning Problem:The general

setting of the learning problem can be described as follows.Let the probability measure be defined on the spaceConsider the set of functions The goal is: tominimize the risk functional

(6)

if probability measure is unknown but an i.i.d. sample

(7)

is given.The learning problems considered above are particular cases

of this general problem ofminimizing the risk functional (6) onthe basis of empirical data(7), where describes a pairand is the specific loss function [for example, oneof (3), (4), or (5)]. Below we will describe results obtainedfor the general statement of the problem. To apply it forspecific problems one has to substitute the corresponding loss-functions in the formulas obtained.

D. Empirical Risk Minimization Induction Principle

In order to minimize the risk functional (6), for an unknownprobability measure the following induction principle isusually used.

The expected risk functional is replaced by theempirical risk functional

(8)

constructed on the basis of the training set (7).The principle is to approximate the function which

minimizes risk (6) by the function which minimizesempirical risk (8). This principle is called the empirical riskminimization induction principle (ERM principle).

E. Empirical Risk Minimization Principleand the Classical Methods

The ERM principle is quite general. The classical methodsfor solving a specific learning problem, such as the leastsquares method in the problem of regression estimation orthe maximum likelihood method in the problem of densityestimation are realizations of the ERM principle for thespecific loss functions considered above.

Indeed, in order to specify the regression problem oneintroduces an -dimensional variable

and uses loss function (4). Using this lossfunction in the functional (8) yelds the functional

which one needs to minimize in order to find the regressionestimate (i.e., the least square method).

In order to estimate a density function from a given set offunctions one uses the loss function (5). Putting thisloss function into (8) one obtains the maximum likelihoodmethod: the functional

which one needs to minimize in order to find the approxima-tion to the density.

Since the ERM principle is a general formulation of theseclassical estimation problems, any theory concerning the ERMprinciple applies to the classical methods as well.

F. Four Parts of Learning Theory

Learning theory has to address the following four questions.

1) What are the conditions for consistency of the ERMprinciple?

To answer this question one has to specify theneces-sary and sufficientconditions for convergence in proba-bility 2 of the following sequences of the random values.

a) The values of risks converging to theminimal possible value of the risk [where

are the expected risks forfunctions each minimizing the empiricalrisk

(9)

b) The values of obtained empirical risksconverging to the

minimal possible value of the risk

(10)

2Convergence in probability of valuesR(�`) means that for any" > 0and for any� > 0 there exists a number0 = `0("; �) such, that for any` > `0 with probability at least1� � the inequality

R(�`)�R(�0) < "

holds true.

References

Page 10: A quick introduction to Statistical Learning Theory and Support Vector Machines

990 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

Equation (9) shows that solutions found using ERMconverge to the best possible one. Equation (10) showsthat values of empirical risk converge to the value ofthe smallest risk.

2) How fast does the sequence of smallest empirical riskvalues converge to the smallest actual risk?In otherwords what is the rate of generalization of a learningmachine that implements the empirical risk minimizationprinciple?

3) How can one control the rate of convergence (the rate ofgeneralization) of the learning machine?

4) How can one construct algorithms that can control therate of generalization?

The answers to these questions form the four parts oflearning theory:

1) the theory of consistency of learning processes;2) the nonasymptotic theory of the rate of convergence of

learning processes;3) the theory of controlling the generalization of learning

processes;4) the theory of constructing learning algorithms.

II. THE THEORY OF CONSISTENCY OFLEARNING PROCESSES

The theory of consistency is an asymptotic theory. It de-scribesthe necessary and sufficient conditionsfor convergenceof the solutions obtained using the proposed method to thebest possible as the number of observations is increased. Thequestion arises:

Why do we need a theory of consistency if our goal is toconstruct algorithms for a small (finite) sample size?

The answer is:We need a theory of consistency because it provides not

only sufficient but necessary conditions for convergence ofthe empirical risk minimization inductive principle. Thereforeany theory of the empirical risk minimization principle mustsatisfy the necessary and sufficient conditions.

In this section, we introduce the main capacity concept (theso-called Vapnik–Cervonenkis (VC) entropy which definesthe generalization ability of the ERM principle. In the nextsections we show that the nonasymptotic theory of learning isbased on different types of bounds that evaluate this conceptfor a fixed amount of observations.

A. The Key Theorem of the Learning Theory

The key theorem of the theory concerning the ERM-basedlearning processesis the following [27].

The Key Theorem:Let be a set of functionsthat has a bounded loss for probability measure

Then for the ERM principle to be consistent it is necessary andsufficient that the empirical risk convergeuniformlyto the actual risk over the set as follows:

(11)

This type of convergence is called uniform one-sided conver-gence.

In other words, according to the Key theorem the conditionsfor consistency of the ERM principle are equivalent to theconditions for existence of uniform one-sided convergence(11).

This theorem is called the Key theorem because it assertsthat any analysis of the convergence properties of the ERMprinciple must be aworst case analysis. The necessary condi-tion for consistency (not only the sufficient condition) dependson whether or not the deviation for the worst function overthe given set of of functions

converges in probability to zero.From this theorem it follows that the analysis of the ERM

principle requires an analysis of the properties of uniformconvergence of the expectations to their probabilities over thegiven set of functions.

B. The Necessary and Sufficient Conditionsfor Uniform Convergence

To describe the necessary and sufficient condition for uni-form convergence (11), we introduce a concept calledtheentropy of the set of functions on the sampleof size

We introduce this concept in two steps: first for sets ofindicator functions and then for sets of real-valued functions.

Entropy of the Set of Indicator Functions:Letbe a set of indicator functions, that is the functions

which take on only the values zero or one. Consider a sample

(12)

Let us characterize the diversity of this set of functionson the given sample by a quantity

that represents the number of differentseparations of this sample that can be obtained using functionsfrom the given set of indicator functions.

Let us write this in another form. Consider the set of-dimensional binary vectors

that one obtains when takes various values from Thengeometrically speaking is the number of dif-ferent vertices of the-dimensional cube that can be obtainedon the basis of the sample and the set of functions

Let us call the value

the random entropy. The random entropy describes the diver-sity of the set of functions on the given data.is a random variable since it was constructed using randomi.i.d. data. Now we consider the expectation of the randomentropy over the joint distribution function

References

Page 11: A quick introduction to Statistical Learning Theory and Support Vector Machines

VAPNIK: OVERVIEW OF STATISTICAL LEARNING THEORY 991

We call this quantity the entropy of the set of indicatorfunctions , on samples of size It depends onthe set of functions , , the probability measure

, and the number of observationsThe entropy describesthe expected diversity of the given set of indicator functionson the sample of size

The main result of the theory of consistency for the pat-tern recognition problem (the consistency for indicator lossfunction) is the following theorem [24].

Theorem: For uniform two-sided convergence of the fre-quencies to their probabilities3

(13)

it is necessary and sufficient that the equality

(14)

hold.Slightly modifying the condition (14) one can obtain the

necessary and sufficient condition for one-sided uniform con-vergence (11).

Entropy of the Set of Real Functions:Now we generalizethe concept of entropy for sets of real-valued functions. Let

be a set of bounded loss functions.Using this set of functions and the training set (12) onecan construct the following set of-dimensional real-valuedvectors

(15)

This set of vectors belongs to the-dimensional cube withthe edge and has a finite -net4 in the metric Let

be the number of elements of theminimal -net of the set of vectors

The logarithm of the (random) value

is called the random VC-entropy5 of the set of functionson the sample The expectation

of the random VC-entropy

is called theVC-entropyof the set of functionson the sample of the size Here expectation

3The sets of indicator functionsR(�) defines probability andRemp(�)defines frequency.

4The set of vectorsq(�); � 2 � has minimal"-net q(�1); � � � ; q(�N )if: 1. There existN = N�("; z1; � � � ; z`) vectorsq(�1); � � � ; q(�N); suchthat for any vectorq(��); �� 2 � one can find among theseN vectors oneq(�r) which is"-close to this vector (in a given metric). For aC metric thatmeans

�(q(��); q(�r)) = max1�i�`

jQ(zi��)�Q(zi; �r)j � ":

N is minimal number of vectors which possess this property.5Note that VC-entropy is different from classical metrical"-entropy

H�cl(") = lnN�(")

where N�(") is cardinality of the minimal"-net of the set of functionsQ(z; �); � 2 �:

is taken with respect to product-measure

The main results of the theory of uniform convergence of theempirical risk to actual risk for bounded loss function includesthe following theorem [24].

Theorem: For uniform two-sided convergence of the em-pirical risks to the actual risks

(16)

it is necessary and sufficient that the equality

(17)

be valid.Slightly modifying the condition (17) one can obtain the

necessary and sufficient condition for one-sided uniform con-vergence (11).

According to the key assertion this implies the necessary andsufficient conditions for consistency of the ERM principle.

C. Three Milestones in Learning Theory

In this section, for simplicity, we consider a set of indicatorfunctions (i.e., we consider the problem ofpattern recognition). The results obtained for sets of indicatorfunctions can be generalized for sets of real-valued functions.

In the previous section we introduced the entropy for setsof indicator functions

Now, we consider two new functions that are constructedon the basis of the values the annealed VC-entropy

and thegrowth function

These functions are determined in such a way that for anythe inequalities

are valid. On the basis of these functions, the three mainmilestones in statistical learning theory are constructed.

In the previous section, we introduced the equation

describing thenecessary and sufficient conditionfor consis-tency of the ERM principle. This equation is the first milestonein learning theory: any machine minimizing empirical riskshould satisfy it.

However, this equation says nothing about the rate ofconvergence of obtained risks to the minimal one

It is possible that the ERM principle is consistent buthas arbitrary slow asymptotic rate of convergence.

References

Page 12: A quick introduction to Statistical Learning Theory and Support Vector Machines

992 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

The question is:Under what conditions is the asymptotic rate of convergence

fast?We say that the asymptotic rate of convergence is fast if for

any the exponential bound

holds true, where is some constant.The equation

describes thesufficientcondition for fast convergence.6 It is thesecond milestone in statistical learning theory: it guarantees afast asymptotic rate of convergence.

Note that both the equation describing the necessary andsufficient condition for consistency and the one that describesthe sufficient condition for fast convergence of the ERMmethod are valid for agiven probability measure (bothVC-entropy and VC-annealed entropy are con-structed using this measure). However our goal is to constructa learning machine for solving many different problems (i.e.,for many different probability measures).

The question is:Under what conditions is the ERM principle consistent and

rapidly converging,independently of the probability measure?The following equation describes thenecessary and suffi-

cient conditionsfor consistency of ERM for any probabilitymeasure

This condition is also sufficient for fast convergence.This equation is the third milestone in statistical learning

theory. It describes the conditions under which the learningmachine implementing ERM principle has an asymptotic highrate of convergence independently of the problem to be solved.

These milestones form a foundation for constructing bothdistribution independent bounds and rigorous distribution de-pendent bounds for the rate of convergence of learning ma-chines.

III. B OUNDS ON THE RATE OF CONVERGENCE

OF THE LEARNING PROCESSES

In order to estimate the quality of the ERM method fora given sample size it is necessary to obtain nonasymptoticbounds on the rate of uniform convergence.

A nonasymptotic bound of the rate of convergence canbe obtained using a new capacity concept, called the VCdimension, which allows us to obtain a constructive boundfor the growth function.

The concept of VC-dimension is based on a remarkableproperty of the growth-function .

6The necessity of this condition for fast convergence is open question.

A. The Structure of the Growth Function

Theorem: Any growth function either satisfies the equality

or is bounded by the inequality

where is an integer for which

In other words the growth function will be either a linearfunction or will be bounded by a logarithmic function. (Forexample, it cannot be of the form

We say that the VC dimension of the set of indicatorfunctions is infinite if the Growth functionfor this set of functions is linear.

We say that the VC dimension of the set of indicatorfunctions is finite and equals if the growthfunction is bounded by a logarithmic function with coefficient

The finiteness of the VC-dimension of the set of indicatorfunctions implemented by the learning machine forms thenecessary and sufficient condition for consistency of the ERMmethod independent of probability measure. Finiteness of VC-dimension also implies fast convergence.

B. Equivalent Definition of the VC Dimension

In this section, we give an equivalent definition of the VCdimension of sets of indicator functions and then we generalizethis definition for sets of real-valued functions.

The VC Dimension of a Set of Indicator Functions:TheVC-dimension of a set of indicator functionsis the maximum number of vectors which canbe separated in all possible ways using functions of thisset7 (shatteredby this set of functions). If for any thereexists a set of vectors which can be shattered by the set

then the VC-dimension is equal to infinity.The VC Dimension of a Set of Real-Valued Functions:Let

be a set of real-valued functionsbounded by constants and can approach andcan approach

Let us consider along with the set of real-valued functionsthe set of indicator functions

(18)

where is some constant, is the step function

ifif

The VC dimension of the set of real valued functionsis defined to be the VC-dimension of the

set of indicator functions (18).7Any indicator function separates a set of vectors into two subsets: the

subset of vectors for which this function takes value zero and the subset ofvectors for which it takes value one.

References

Page 13: A quick introduction to Statistical Learning Theory and Support Vector Machines

VAPNIK: OVERVIEW OF STATISTICAL LEARNING THEORY 993

C. Two Important Examples

Example 1:

1) The VC-dimension of the set oflinear indicator func-tions

in -dimensional coordinate space isequal to , since using functions of this set onecan shatter at most vectors. Here is the stepfunction, which takes value one, if the expression in thebrackets is positive and takes value zero otherwise.

2) The VC-dimension of the set oflinear functions

in -dimensional coordinate space isalso equal to because the VC-dimension ofcorresponding linear indicator functions is equal to(using instead of does not changes the set ofindicator functions).

Example 2: We call a hyperplane

the -margin separating hyperplane if it classifies vectorsas follows:

ifif

(classifications of vectors that fall into the marginare undefined).

Theorem: Let vectors belong to a sphere of radius. Then the set of -margin separating hyperplanes has the

VC dimension bounded by the inequality

These examples show that in general the VC dimensionof the set of hyperplanes is equal to , where isdimensionality of input space. However, the VC dimensionof the set of -margin separating hyperplanes (with a largevalue of margin can be less than This fact will playan important role for constructing new function estimationmethods.

D. Distribution Independent Bounds for the Rate ofConvergence of Learning Processes

Consider sets of functions which possess a finite VC-dimension We distinguish between two cases:

1) the case where the set of loss functionsis a set oftotally bounded functions;

2) the case where the set of loss functionsis not necessarily a set of totally bounded functions.

Case 1—The Set of Totally Bounded Functions:Withoutrestriction in generality, we assume that

(19)

The main result in the theory of bounds for sets of totallybounded functions is the following [20]–[22].

Theorem: With probability at least , the inequality

(20)

holds true simultaneously for all functions of the set (19),where

(21)

For the set of indicator functions,This theorem provides bounds for the risks of all func-

tions of the set (18) [including the function whichminimizes empirical risk (8)]. The bounds follow from thebound on uniform convergence (13) for sets of totally boundedfunctions that have finite VC dimension.

Case 2—The Set of Unbounded Functions:Consider theset of (nonnegative) unbounded functions

It is easy to show (by constructing an example) that,without additional information about the set of unboundedfunctions and/or probability measures, it is impossible toobtain an inequality of type (20). Below we use the followinginformation:

(22)

where is some fixed constant.8

The main result for the case of unbounded sets of lossfunctions is the following [20]–[22].

Theorem: With probability at least the inequality

(23)

holds true simultaneously for all functions of the set, whereis determined by (22),

The theorem bounds the risks for all functions of the set(including the function

8This inequality describes some general properties of distribution functionsof the random variables�� = Q(z; �), generated by theP (z): It describes the“tails of distributions” (the probability of big values for the random variables��): If the inequality (22) withp > 2 holds, then the distributions have so-called “light tails” (large values do not occurs very often). In this case rapidconvergence is possible. If, however, (22) holds only forp < 2 (large valuesof the random variables�� occur rather often) then the rate of convergencewill be small (it will be arbitrarily small ifp is sufficiently close to one).

References

Page 14: A quick introduction to Statistical Learning Theory and Support Vector Machines

994 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

E. Problem of Constructing Rigorous(Distribution Dependent) Bounds

To construct rigorous bounds for the rate of convergenceone has to take into account information about probabilitymeasure. Let be a set of all probability measures and let

be a subset of the set We say that one has priorinformation about an unknown probability measure ifone knows the set of measuresthat contains

Consider the following generalization of the growth func-tion:

For indicator functions and for the extremecase where the generalized growth functioncoincides with the growth function For another extremecase where contains only one function the generalizedgrowth function coincides with the annealed VC-entropy.

The following assertion is true [20], [26].Theorem: Suppose that a set of loss-functions is bounded

Then for sufficiently large the following inequality:

holds true.From this bound it follows that for sufficiently largewith

probability simultaneously for all (including theone that minimizes the empirical risk) the following inequalityis valid:

However, this bound is nonconstructive because theorydoes not specify a method to evaluate the generalized growthfunction. To make this bound constructive and rigorous onehas to estimate the generalized growth function for a givenset of loss-functions and a given set of probability measures.This is one of the main subjects of the current learning theoryresearch.

IV. THEORY FOR CONTROLLING THE

GENERALIZATION OF LEARNING MACHINES

The theory for controlling the generalization of a learningmachine is devoted to constructing an induction principle forminimizing the risk functional which takes into account thesize of the training set(an induction principle for a “small”

sample size).9 The goal is to specify methods which areappropriate for a given sample size.

A. Structural Risk Minimization Induction Principle

The ERM principle is intended for dealing with a largesample size. Indeed, the ERM principle can be justified byconsidering the inequalities (20). When is large, the secondsummand on the right hand side of inequality (20) becomessmall. The actual risk is then close to the value of the empiricalrisk. In this case, a small value of the empirical risk providesa small value of (expected) risk.

However, if is small, then even a smalldoes not guarantee a small value of risk. In this case theminimization for requires a new principle, based onthe simultaneous minimization of two terms in (20) one ofwhich depends on the value of the empirical risk while thesecond depends on the VC-dimension of the set of functions.To minimize risk in this case it is necessary to find a methodwhich, along with minimizing the value of empirical risk,controls the VC-dimension of the learning machine.

The following principle, which is called the principle ofstructural risk minimization (SRM), is intended to minimizethe risk functional with respect to both empirical risk andVC-dimension of the set of functions.

Let the set of functions be provided witha structure: so that is composed of the nested subsets offunctions such that

(24)

andAn admissible structureis one satisfying the following three

properties.

1) The set is everywhere dense in2) The VC-dimension of each set of functions is

finite.3) Any element of the structure contains totally bounded

functions

The SRM principle suggests that for a given set of obser-vations choose the element of structure , where

and choose the particular function from for whichthe guaranteed risk (20) is minimal.

The SRM principle actually suggests atradeoff betweenthe quality of the approximation and the complexity of theapproximating function. (As increases, the minima of em-pirical risk are decreased; however, the term responsible forthe confidence interval [summand in (20)] is increased. TheSRM principle takes both factors into account.)

The main results of the theory of SRM are the following[9], [22].

Theorem: For any distribution function the SRM methodprovides convergence to the best possible solution with prob-ability one.

In other words SRM method is universally strongly con-sistent.

9The sample size is considered to be small if=h is small, say =h < 20:

References

Page 15: A quick introduction to Statistical Learning Theory and Support Vector Machines

VAPNIK: OVERVIEW OF STATISTICAL LEARNING THEORY 995

Theorem: For admissible structures the method of structuralrisk minimization provides approximations forwhich the sequence of risks converge to the bestone with asymptotic rate of convergence10

(25)

if the law is such that

(26)

In (25) is the bound for functions from and isthe rate of approximation

V. THEORY OF CONSTRUCTING LEARNING ALGORITHMS

To implement the SRM induction principle in learningalgorithms one has to control two factors that exist in thebound (20) which has to be minimized:

1) the value of empirical risk;2) the capacity factor (to choose the element with the

appropriate value of VC dimension).

Below we restrict ourselves to the pattern recognition case.We consider two type of learning machines:

1) Neural networks (NN’s) that were inspired by the bio-logical analogy to the brain;

2) the support vector machines that were inspired by sta-tistical learning theory.

We will discuss how each corresponding machine cancontrol these factors.

A. Methods of Separating Hyperplanes andTheir Generalization

Consider first the problem of minimizing empirical risk onthe set oflinear indicator functions

(27)

Let

be a training set, where is a vector,

To minimize the empirical risk one has to find the pa-rameters (weights) which minimize theempirical risk functional

(28)

There are several methods for minimizing this functional. Inthe case when the minimum of the empirical risk is zero one

10We say that the random variables�`; ` = 1; 2; � � � converge to the value�0 with asymptotic rateV (`) if there exists constantC such that

V �1(`)j�` � �0j �!P

`!1C:

can find the exact solution while when the minimum of thisfunctional is nonzero one can find an approximate solution.Therefore by constructing a separating hyperplane one cancontrol the value of empirical risk.

Unfortunately the set of separating hyperplanes is not flex-ible enough to provide low empirical risk for many real-lifeproblems [13].

Two opportunities were considered to increase the flexibilityof the sets of functions:

1) to use a richer set of indicator functions which aresuperpositions of linear indicator functions;

2) to map the input vectors in high dimensional featurespace and construct in this space a-margin separatinghyperplane (see Example 2 in Section III-C)

The first idea corresponds to the neural network. The secondidea leads to support vector machines.

B. Sigmoid Approximation of IndicatorFunctions and Neural Nets

To describe the idea behind the NN let us consider themethod of minimizing the functional (28). It is impossibleto use regulargradient-basedmethods of optimization to min-imize this functional. (The gradient of the indicator function

is either equal to zero or is undefined.) The solutionis to approximate the set of indicator functions (27) by so-called sigmoid functions

(29)

where is a smooth monotonic function such thatFor example, the functions

are sigmoid functions.For the set of sigmoid function, the empirical risk functional

(30)

is smooth in It has a gradient grad and thereforecan be minimized using gradient-based methods. For example,the gradient descent methoduses the following update rule:

where the data depends on the iterationnumber For convergence of the gradient descent method toa local minimum, it is enough that satisfy the conditions

Thus, the idea is to use the sigmoid approximation at the stageof estimating the coefficients, and use the indicator functionswith these coefficients at the stage of recognition.

The generalization of this idea leads to feedforward NN’s.In order to increase the flexibility of the set of decision rules

References

Page 16: A quick introduction to Statistical Learning Theory and Support Vector Machines

996 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

of the learning machine one considers a set of functionswhich are the superposition of several linear indicator func-tions (networks of neurons) [13] instead of the set of linearindicator functions (single neuron). All indicator functions inthis superposition are replaced by sigmoid functions.

A method for calculating the gradient of the empirical riskfor the sigmoid approximation of NN’s, called thebackprop-agation method, was found [15], [12]. Using this gradientdescent method, one can determine the corresponding coef-ficient values (weights) of all elements of the NN.

In the 1990s, it was proven that the VC dimension of NN’sdepends on the type of sigmoid functions and the number ofweights in the NN. Under some general conditions the VCdimension of the NN is bounded (although it is sufficientlylarge). Suppose that the VC dimension does not change duringthe NN training procedure, then the generalization ability ofNN depends on how well the NN minimizes the empirical riskusing sufficiently large training data.

The three main problems encountered when minimizatingthe empirical risk using the backpropagation method are asfollows.

1) The empirical risk functional has many local minima.Optimization procedures guarantee convergence to somelocal minimum. In general the function which is foundusing the gradient-based procedure can be far from thebest one. The quality of the obtained approximationdepends on many factors, in particular on the initialparameter values of the algorithm.

2) Convergence to a local minimum can be rather slow (dueto the high dimensionality of the weight-space).

3) The sigmoid function has a scaling factor which affectsthe quality of the approximation. To choose the scalingfactor one has to make a tradeoff between quality ofapproximation and the rate of convergence.

Therefore, a good minimization of the empirical risk de-pends in many respects on the art of the researcher.

C. The Optimal Separating Hyperplanes

To introduce the method which is an alternative to the NNlet us consider the optimal separating hyperplanes [25].

Suppose the training data

can be separated by a hyperplane

(31)

We say that this set of vectors is separated by theoptimal hy-perplane (or the maximal margin hyperplane)if it is separatedwithout error and the distance between the closest vector andthe hyperplane is maximal.

To describe the separating hyperplane let us use the follow-ing form:

if

if

In the following we use a compact notation for these inequal-ities:

(32)

It is easy to check that the Optimal hyperplane is the one thatsatisfies the conditions (32) and minimizes functional

(33)

(The minimization is taken with respect to both vectorandscalar )

The solution to this optimization problem is given by thesaddle point of the Lagrange functional (Lagrangian)

(34)

where the are Lagrange multipliers. The Lagrangian has tobe minimized with respect to and maximized with respectto

In the saddle point, the solutions and shouldsatisfy the conditions

Rewriting these equations in explicit form one obtains thefollowing properties of the optimal hyperplane.

1) The coefficients for the optimal hyperplane shouldsatisfy the constraints

(35)

2) The parameters of the optimal hyperplane (vector)are linear combination of the vectors of the training set.

(36)

3) The solution must satisfy the following Kuhn–Tuckerconditions:

(37)

From these conditions it follows that only some trainingvectors in expansion (36), thesupport vectors, can havenonzero coefficients in the expansion of Thesupport vectors are the vectors for which, in (36), theequality is achieved. Therefore we obtain

(38)

Substituting the expression for back into the Lagrangianand taking into account the K¨uhn–Tucker conditions, oneobtains the functional

(39)

It remains to maximize this functional in the nonnegativequadrant

References

Page 17: A quick introduction to Statistical Learning Theory and Support Vector Machines

VAPNIK: OVERVIEW OF STATISTICAL LEARNING THEORY 997

under the constraint

(40)

Putting the expression for in (31) we obtain the hyperplaneas an expansion on support vectors

(41)

To construct the optimal hyperplane in the case whenthe data are linearly nonseparable, we introduce nonnegativevariables and the functional

which we will minimize subject to constraints

Using the same formalism with Lagrange multipliers onecan show that the optimal hyperplane also has an expansion(41) on support vectors. The coefficients can be found bymaximizing the same quadratic form as in the separable case(39) under slightly different constraints

(42)

D. The Support Vector Network

The support-vector network implements the following idea[21]: Map the input vectors into a very high-dimensional fea-ture space through some nonlinear mapping chosena priori.In this space comstruct an optimal separating hyperplane. Thegoal is to create the situation described in Example 2 ofSection III-C, where for -margin separating hyperplanes theVC dimension is defined by the ratio To generalizewell, we control (decrease) the VC dimension by constructingan optimal separating hyperplane (that maximizes the margin).To increase the margin we use very high dimensional spaces.

Example: Consider a maping that allows us to constructdecision polynomials in the input space. To construct a poly-nomial of degree two, one can create a feature spacewhichhas coordinates of the form

coordinates

coordinates

coordinates

where The separating hyperplane con-structed in this space is a separating second-degree polynomialin the input space.

To construct a polynomial of degreein an -dimensionalinput space one has to construct -dimensional featurespace, where one then constructs the optimal hyperplane.

The problem then arises of how to computationally dealwith such high-dimensional spaces: to construct a polynomialof degree 4 or 5 in a 200-dimensional space it is necessary toconstruct hyperplanes in a billion-dimensional feature space.

In 1992, it was noted [5] that for both describing the optimalseparating hyperplane in the feature space (41) and estimatingthe corresponding coefficients of expansion of the separatinghyperplane (39) one uses the inner product of two vectors

and , which are images in the feature space of theinput vectors and Therefore if one can estimate theinner product of two vectors in the feature space and

as a function of two variables in input space

than it will be possible to construct the solutions which areequivalent to the optimal hyperplane in the feature space. Toget this solution one only needs to replace the inner product

in (39) and (41) with the functionIn other words, one constructs nonlinear decision functions

in the input space

(43)

that are equivalent to the linear decision functions (33) in thefeature space. The coefficientsin (43) are defined by solvingthe equation

(44)

under constraints (42).In 1909 Mercer proved a theorem which defines the general

form of inner products in Hilbert spaces.Theorem: The general form of the inner product in Hilbert

space is defined by the symmetric positive definite functionthat satisfies the condition

for all functions satisfying the inequality

Therefore any function satisfying Mercer’s condi-tion can be used for constructing rule (43) which is equivalentto constructing an optimal separating hyperplane in somefeature space.

The learning machines which construct decision functions ofthe type (43) are calledsupport vectors networks or supportvector machines(SVM’s).11

Using different expressions for inner products onecan construct different learning machines with arbitrary typesof (nonlinear in input space) decision surfaces.

11This name stresses that for constructing this type of machine, the ideaof expanding the solution on support vectors is crucial. In the SVM thecomplexity of construction depends on the number of support vectors ratherthan on the dimensionality of the feature space.

References

Page 18: A quick introduction to Statistical Learning Theory and Support Vector Machines

998 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

For example to specify polynomials of any fixed orderone can use the following functions for the inner product inthe corresponding feature space:

Radial basis function machines with decision functions of theform

can be implemented by using a function of the type

In this case the SVM machine will find both the centersand the corresponding weights

The SVM possesses some useful properties.

• The optimization problem for constructing an SVM hasa unique solution.

• The learning process for constructing an SVM is ratherfast.

• Simultaneously with constructing the decision rule, oneobtains the set of support vectors.

• Implementation of a new set of decision functions can bedone by changing only one function (kernelwhich defines the dot product in-space.

E. Why Can Neural Networks and SupportVectors Networks generalize?

The generalization ability of both the NN’s and supportvectors networks is based on the factors described in the theoryfor controlling the generalization of the learning processes. Ac-cording to this theory, to guarantee a high rate of generalizationof the learning machine one has to construct a structure

on the set of decision functions andthen choose both an appropriate elementof the structureand a function within this element thatminimizes bound (20). The bound (16) can be rewritten inthe simple form

(45)

where the first term is an estimate of the risk and the secondis the confidence interval for this estimate.

In designing an NN, one determines a set of admissiblefunctions with some VC-dimension For a given amountof training data the value determines the confidence interval

for the network. Choosing the appropriate element ofa structure is therefore a problem of designing the network fora given training set.

During the learning process this network minimizes the firstterm in the bound (45) (the number of errors on the trainingset).

If it happens that at the stage of designing the network oneconstructs a network too complex (for the given amount of

training data), the confidence interval will be large.In this case, even if one could minimize the empirical riskdown to zero, the amount of errors on the test set could bebig. This case is calledoverfitting.

To avoid over fitting (to get a small confidence interval) onehas to construct networks with small VC-dimension.

Therefore to generalize well using an NN one must firstsuggest an appropriate architecture of the NN and second findin this network the function that minimizes the number oferrors on the training data. For NN’s both of these prob-lems are solving using some heuristics (see remarks on thebackpropagation method).

In support vector methods one can control both parameters:in the separable case one obtains the unique solution whichminimizes the empirical risk (down to zero) using a-marginseparating hyperplane with the maximal margin (i.e., subsetwith the smallest VC dimension).

In the general case one obtains the unique solution whenone chooses the value of the trade off parameter

VI. CONCLUSION

This article presents a very general overview of statisticallearning theory. It demonstrates how an abstract analysisallows us to discover a general model of generalization.

According to this model, the generalization ability of learn-ing machines depends on capacity concepts which are moresophisticated than merely the dimensionality of the space orthe number of free parameters of the loss function (these con-cepts are the basis for the classical paradigm of generalization).

The new understanding of the mechanisms behind gen-eralization not only changes the theoretical foundation ofgeneralization (for example from the new point of view theOccam razor principle is not always correct), but also changesthe algorithmic approaches to function estimation problems.The approach described is rather general. It can be appliedfor various function estimation problems including regression,density estimation, solving inverse equations and so on.

Statistical learning theory started more than 30 years ago.The development of this theory did not involve many re-searchers. After the success of the SVM in solving real-lifeproblems, the interest in statistical learning theory significantlyincreased. For the first time, abstract mathematical results instatistical learning theory have a direct impact on algorithmictools of data analysis. In the last three years a lot of articleshave appeared that analyze the theory of inference and theSVM method from different perspectives. These include:

1) obtaining better constructive bounds than the classicalone described in this article (which are closer in spirit tothe nonconstructive bound based on the growth functionthan on bounds based on the VC dimension concept).Success in this direction could lead, in particular, tocreating machines that generalize better than the SVMbased on the concept of optimal hyperplane;

2) extending the SVM ideology to many different problemsof function and data-analysis;

3) developing a theory that allows us to create kernelsthat possess desirable properties (for example that canenforce desirable invariants);

References

Page 19: A quick introduction to Statistical Learning Theory and Support Vector Machines

VAPNIK: OVERVIEW OF STATISTICAL LEARNING THEORY 999

4) developing a new type of inductive inference that isbased on direct generalization from the training set to thetest set, avoiding the intermediate problem of estimatinga function (the transductive type inference).

The hope is that this very fast growing area of research willsignificantly boost all branches of data analysis.

ACKNOWLEDGMENT

The author wishes to thank F. Mulier for discussions andhelping to make this article more clear and readable.

REFERENCES

[1] N. Alon, B.-David, N. Cesa-Bianchi, and D. Haussler, “Scale-sensitivedimensions, uniform convergence, and learnability,”J. ACM, vol. 44,no. 4, pp. 617–631, 1997.

[2] P. L. Bartlett, P. Long, and R. C. Williamson, “Fat-shattering and thelearnability of real-valued functions,”J. Comput. Syst. Sci., vol. 52, no.3, pp. 434–452, 1996.

[3] P. L. Bartlett and J. Shawe-Taylor, “Generalization performance onsupport vector machines and other pattern classifiers,” in B. Sholkopf,C. Burges, and A. Smola, Eds.,Advances in Kernel Methods—SupportVector Learning. Cambridge, MA: MIT Press, 1999.

[4] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learn-ability and the Vapnik–Chervonenkis dimension,”J. ACM, vol. 36, no.4, pp. 929–965, 1989.

[5] B. Boser, I. Guyon, and V. N. Vapnik, “A training algorithm for optimalmargin classifiers,” inProc. 5th Annu. Wkshp. Comput. Learning Theory.Pittsburgh, PA: ACM, 1992, pp. 144–152.

[6] C. J. C. Burges, “Simplified support vector decision rule,” inProc. 13thInt. Conf. Machine Learning, San Mateo, CA, 1996, pp. 71–77.

[7] , “Geometry and invariance in kernel-based methods,” in B.Sholkopf, C. Burges, and A. Smola, Eds.,Advances in Kernel Meth-ods—Support Vector Learning. Cambridge, MA: MIT Press, 1999.

[8] C. Cortes and V. Vapnik, “Support vector networks,”Machine Learning,vol. 20, pp. 273–297, 1995.

[9] L. Devroye, L. Gyorfi, and G. Lugosi,A Probability Theory of PatternRecognition. New York: Springer-Verlag, 1996.

[10] F. Girosi, “An equivalence between sparse approximation and supportvector machines,”Neural Comput., vol. 10, no. 6, pp. 1455–1480, 1998.

[11] F. Girosi, M. Jones, and T. Poggio, “Regularization theory and neuralnetworks architectures,”Neural Comput., vol. 7, no. 2, pp. 219–269,1995.

[12] Y. Le Cun, “Learning processes in an asymmetric threshold network,” inE. Beinenstock, F. Fogelman-Soulie, and G. Weisbuch, Eds.,DisorderedSystems and Biological Organizations. Les Houches, France: Springer-Verlag, 1986, pp. 233–240.

[13] M. L. Minsky and S. A. Papert,Perceptrons. Cambridge, MA: MITPress, 1969, p. 248.

[14] M. Opper, “On the annealed VC entropy for margin classifiers: A statis-tical mechanics study,” in B. Sholkopf, C. Burges, and A. Smola, Eds.,Advances in Kernel Methods—Support Vector Learning. Cambridge,MA: MIT Press, 1999.

[15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internalrepresentations by error propagation,” inParallel Distributed Process-

ing: Explorations in Macrostructure of Cognition, Vol. I Cambridge,MA: Badford, 1986, pp. 318–362.

[16] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony,“Structural risk minimization,”IEEE Trans. Inform. Theory, 1998.

[17] B. Sholkopf, A. Smola, and K. R. Muller, “Nonlinear componentanalysis as a kernel eigenvalue problem,”Neural Comput., vol. 10, pp.1229–1319, 1998.

[18] , “The connection between regularization operators and supportvector kernels,”Neural Networks, vol. 11, pp. 637–649, 1998.

[19] M. Talagrand, “The Glivenko-Cantelli problem, ten years later,”J.Theoretical Probability, vol. 9, no. 2, pp. 371–384, 1996.

[20] V. N. Vapnik, Estimation of Dependencies Based on Empirical Data.Moscow, Russia: Nauka, 448 pp., 1979 (in Russian). English translation,New York: Springer-Verlag, 400 pp., 1982.

[21] , The Nature of Statistical Learning Theory. New York:Springer-Verlag, 1995, p. 188.

[22] , Statistical Learning Theory. New York: Wiley, 1998, p. 736.[23] V. N. Vapnik and A. Ja. Chervonenkis, “On the uniform convergence

of relative frequencies of events to their probabilities,” Rep. AcademySci. USSR, p. 181, no 4, 1968.

[24] , “On the uniform convergence of relative frequencies of eventsto their probabilities,”Theory Probab. Apl., vol. 16, pp. 264–280, 1971.

[25] , Theory of Pattern Recognition. Moscow, Russia: Nauka, 1974(in Russian). German translation: W. N. Wapnik and A. Ja. ChervonenkisTheorie der Zeichenerkennung. Berlin, Germany: Akademia-Verlag,353 pp., 1979.

[26] , “Necessary and sufficient conditions for the uniform conver-gence of the means to their expectations,”Theory Probab. Applicat.,vol. 26. pp. 532–553, 1981.

[27] , “The necessary and sufficient conditions for consistency of themethod of empirical risk minimization,”Yearbook of the Academy ofSciences of the USSRon Recognition, Classification, and Forecasting,vol. 2, pp. 217–249, Nauka Moscow, 1989 (in Russian). Englishtranslation: Pattern Recogn. and Image Analysis, vol. 1, no. 3, pp.284–305, 1991.

[28] Vidyasagar,A Theory of Learning and Generalization. New York:Springer, 1997.

[29] Wahba,Spline Models for Observational Data, vol. 59. Philadelphia,PA: SIAM, 1990.

[30] R. C. Williamson, A. Smola, and B. Sholkopf, “Entropy number,operators, and support vector kernels,” in B. Sh¨olkopf, C. Burges, andA. Smola, Eds.,Advances in Kernel Methods—Support Vector Learning.Cambridge, MA: MIT Press, 1999.

Vladimir N. Vapnik was born in Russia and re-ceived the Ph.D. degree in statistics from the Insti-tute of Control Sciences, Academy of Science of theUSSR, Moscow, Russia, in 1964.

Since 1991, he has been working for AT&T BellLaboratories (since 1996, AT&T Labs Research),Red Bank, NJ. His research interests include statisti-cal learning theory, theoretical and applied statistics,theory and methods for solving stochastic ill-posedproblems, and methods of multidimensional func-tion approximation. His main results in the last three

years are related to the development of the support vector method. He is authorof many publications, including seven monographs on various problems ofstatistical learning theory.

References

Page 20: A quick introduction to Statistical Learning Theory and Support Vector Machines

A Training Algorithm for

Optimal Margin Classifiers

Bernhard E. Boser” Isabelle M. Guyon Vladimir N. VapnikEECS Department AT&T Bell Laboratories AT&T Bell Laboratories

University of California 50 Fremont Street, 6th Floor Crawford Corner RoadBerkeley, CA 94720 San Francisco, CA 94105 Holmdel, NJ 07733

[email protected]. edu isabelle@neural .att .com vlad@neural. att .com

Abstract

A training algorithm that maximizes the mar-gin between the training patterns and the de-cision boundary is presented. The techniqueis applicable to a wide variety of classifiac-tion functions, including Perceptions, polyno-mials, and Radial Basis Functions. The ef-fective number of parameters is adjusted auto-matically to match the complexity of the prob-lem. The solution is expressed as a linear com-bination of supporting patterns. These are thesubset of training patterns that are closest tothe decision boundary. Bounds on the general-ization performance based on the leave-one-outmethod and the VC-dimension are given. Ex-perimental results on optical character recog-nition problems demonstrate the good gener-alization obtained when compared with otherlearning algorithms.

1 INTRODUCTION

Good generalization performance of pattern classifiers isachieved when the capacity of the classification functionis matched to the size of the training set. Classifiers witha large number of adjustable parameters and thereforelarge capacity likely learn the training set without error,but exhibit poor generalization. Conversely, a classifierwith insufficient capacity might not be able to learn thetask at all. In between, there is an optimal capacity ofthe classifier which minimizes the expected generaliza-tion error for a given amount of training data. Bothexperimental evidence and theoretical studies [GBD92,

*Part of this work was performed while B. Boser waswith AT&T Bell Laboratories. He is now at the Universityof California, Berkeley.

Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed for

direct commercial advantage, the ACM copyright notice and the

title of tha publication and its date appear, and notice ia given

that copying is by permission of the Association for Computing

Machinery. To copy otherwise, or to republish, requires a fee

and/or specific permission.

COLT’92-71921PA, USA

@ 1992 ACM 0-89791 -498 -8/92 /0007 /0144 . ..$1 .50

Mo092, GVB+92, Vap82, BH89, TLS89, Mac92] link thegeneralization of a classifier to the error on the trainingexamples and the complexity of the classifier. Meth-ods such as structural risk minimization [Vap82] varythe complexity of the classification function in order tooptimize the generalization.

In this paper we describe a training algorithm that au-tomatically tunes the capacity of the classification func-tion by maximizing the margin between training exam-ples and class boundary [KM87], optionally after re-moving some atypical or meaningless examples from thetraining data. The resulting classification function de-pends only on so-called supporting patterns [Vap82].These are those training examples that are closest tothe decision boundary and are usually a small subset ofthe training data.

It will be demonstrated that maximizing the marginamounts to minimizing the maximum loss, as opposedto some average quantity such as the mean squared er-ror. This has several desirable consequences. The re-sulting classification rule achieves an errorless separa-tion of the training data if possible. Outliers or mean-ingless patterns are identified by the algorithm and cantherefore be eliminated easily with or without super-vision. This contrasts classifiers based on minimizingthe mean squared error, which quiet ly ignore at ypi-cal patterns. Another advantage of maximum marginclassifiers is that the sensitivity of the classifier to lim-ited computational accuracy is minimal compared toother separations with smaller margin. In analogy to[Vap82, HLW88] a bound on the generalization perfor-mance is obtained with the “leave-one-out” method. Forthe maximum margin classifier it is the ratio of the num-ber of linearly independent supporting patterns to thenumber of training examples. This bound is tighter thana bound based on the capacity of the classifier family.

The proposed algorithm operates with a large class ofdecision functions that are linear in their parametersbut not restricted to linear dependence in the inputcomponents. Perceptions [Ros62], polynomial classi-fiers, neural networks with one hidden layer, and RadialBasis Function (RBF) or potential function classifiers[ABR64, BL88, MD89] fall into this class. As pointedout by several authors [ABR64, DH73, PG90], Percep-

144

References

UPCLab2013
Typewritten text
REF [2]
Page 21: A quick introduction to Statistical Learning Theory and Support Vector Machines

trons have a dual kernel representation implementingthe same decision function. The optimal margin algo-rithm exploits this duality both for improved efficiencyand flexibility. In the dual space the decision functionis expressed as a linear combination of basis functionsparametrized by the supporting patterns. The support-ing patterns correspond to the class centers of RBFclassifiers and are chosen automatically by the maxi-mum margin training procedure. In the case of polyno-mial classifiers, the Perception represent ation involvesan untractable number of parameters. This problem isovercome in the dual space representation, where theclassification rule is a weighted sum of a kernel func-tion [Pog75] for each supporting pattern. High orderpolynomial classifiers with very large training sets cantherefore be handled efficiently with the proposed algo-rithm.

The training algorithm is described in Section 2. Section3 summarizes important properties of optimal marginclassifiers. Experimental results are reported in Section4.

2 MAXIMUM MARGIN TRAINING

ALGORITHM

The maximum margin training algorithm finds a deci-sion function for pattern vectors x of dimension n be-longing to either of two classes A and B. The input tothe training algorithm is a set of p examples xi withlabels gi:

(Xl) vi), (X2, Y2), (X3, Y3), . ~, (Xp, yp) (1)

where{

yk=l if xk E class A~k = –1 ifxk E ClaSSB.

From these training examples the algorithm finds theparameters of the decision function D(x) during a learn-ing phase. After training, the classification of unknownpatterns is predicted according to the following rule:

XEA if D(x)>Ox c B otherwise.

(2)

The decision functions must be linear in their parame-ters but are not restricted to linear dependence of x.These functions can be expressed either in direct, or indual space. The direct space notation is identical to thePerception decision function [Ros62]:

N

D(X) = ~ W/f~(X) + b. (3)i=l

In this equation the pi are predefine functions of x, andthe Wi and b are the adjustable parameters of the deci-sion function. Polynomial classifiers are a special case ofPerceptions for which pi(x) are products of componentsof x.

In the dual space, the decision functions are of the form

(4)

The coefficients ak are the parameters to be adjustedand the xk are the training patterns. The function Kis a predefine kernel, for example a potential function[ABR64] or any Radial Basis Function [BL88, MD89].Under certain conditions [C H!53], symmetric kernelspossess finite or infinite series expansions of the form

K(x, x’) = ~ pi(x) $Di(x’).

i

(5)

In particular, the kernel K(x, x’) = (x . x’ + 1)9 cor-responds to a polynomial expansion p(x) of order q

[Pog75].

Provided that the expansion stated in equation 5 exists,equations 3 and 4 are dual representations of the samedecision function and

The parameters wi are called direct parameters, and theCYkare referred to as dual parameters.

The proposed training algorithm is based on the “gener-alized portrait” method described in [Vap82] that con-structs separating hyperplanes with maximum margin.Here this algorithm is extended to train classifiers lin-ear in their parameters, First, the margin between theclass boundary and the training patterns is formulatedin the direct space. This problem description is thentransformed into the dual space by means of the La-grangian. The resulting problem is that of maximizinga quadratic form with constraints and is amenable toefficient numeric optimization algorithms [Lue84].

2.1 MAXIMIZING THE MARGIN IN THEDIRECT SPACE

In the direct space the decision function is

D(X) = W ~$O(X) + b, (7)

where w and P(X) are N dimensional vectors and b isa bias. It defines “a separating hyperplane in ~-space.The distance between this hyperplane and pattern xis D(x)/ llwll (Figure 1). Assuming that a separationof the training set with margin M between the classboundary and-the training p~tterns exists, all trainingpatterns fulfill the following inequality:

(8)

The objective of the training algorithm is to find theparameter vector w that maximizes M:

M* = max M (9)W,llwl[=l

subject to .yk~(x~) ~ M, k=l,2, . . ..p.

The bound M* is attained for those patterns satisfying

m~nyk~(xk) = M*. (lo)k=l

145

References

Page 22: A quick introduction to Statistical Learning Theory and Support Vector Machines

\\

o ‘\\ x,\\

Figure 1: Maximum margin linear decision function D(x) = w . x + b (V = x). The gray levels encode the absolutevalue of the decision function (solid black corresponds to D(x) = O). The numbers indicate the supporting patterns.

These patterns are called the supporting patterns of thedecision boundary.

A decision function with maximum margin is illustratedin figure 1. The problem of finding a hyperplane in~-space with maximum margin is therefore a minimaxproblem:

max miny~ll(x~).W,llwll=l k

(11)

The norm of the parameter vector in equations 9 and11 is fixed to pick one of an infinite number of possiblesolutions that differ only in scaling. Instead of fixingthe norm of w to take care of the scaling problem, theproduct of the margin M and the norm of a weightvector w can be fixed.

fqwl] = 1. (12)

Thus, maximizing the margin A4 is equivalent to mini-mizing the norm IIwII.1 Then the problem of finding amaximum margin separating hyperplane w* stated in 9reduces to solving the following quadratic problem:

m~llw112 (13)

under conditions yk ~(xk ) ~ 1, k=l,2, . . ..p.

The maximum margin is M“ = l/l\w*ll.

In principle the problem stated in 13 can be solved di-rectly with numerical techniques. However, this ap-proach is impractical when the dimensionality of the~-space is large or infinite. Moreover, no information isgained about the supporting patterns.

1If the tr~~ing &ta is not linearly separable the maXi-mum margin may be negative. In this case, Jfllwll = –1is imposed. Maximizing the margin is then equivalent tomaximizing IIw II.

2.2 MAXIMIZING THE MARGIN IN THEDUAL SPACE

Problem 13 can be transformed into the dual space bymeans of the Lagrangian [Lue84]

L(w, b,a) = ;IIW112 – ~a’k [~k~(Xk) – 1](14)

k=l

subject to ~k ~ O, k=l,2,..., p.

The factors ~k are called Lagrange multipliers or Kiihn-Tucker coefficients and satisfy the conditions

~k (y~~(X~) – 1) = O, k=l,2,..., p. (15)

The factor one half has been included for cosmetic rea-sons; it does not change the solution.

The optimization problem 13 is equivalent to searchinga saddle point of the function L(w, b, a). This saddlepoint is a the minimum of L(w, 6, a) with respect to w,and a maximum with respect to a (cr~ > O). At thesolution, the following necessary condition—is met:

hence

w* = 2 ~;yk~~. (16)k=l

The patterns which satisfy y~D(x~) = 1 are the sup-porting patterns. According to equation 16, the vectorw* that specifies the hyperplane with maximum marginis a linear combination of only the supporting patterns,which are those patterns for which a~ # 0. Usually thenumber of supporting patterns is much smaller than thenumber p of patterns in the training set.

146

References

Page 23: A quick introduction to Statistical Learning Theory and Support Vector Machines

The dependence of the Lagrangian L(w, b,a) on theweight vector w is removed by substituting the expan-sion of w* given by equation 16 for w. Further trans-formations using 3 and 5 result in a Lagrangian whichis a function of the parameters a and the bias b only:

J(a, b)= ~a~(l–by~)– jcdl.cx, (17)kml

subject to ~k ~ O, k=l,2,..., p.

Here H is a square matrix of size p x p with elements

~kl = yky~~f(xk,x~).

In order for a unique solution to exist, II must be posi-tive definite. For fixed bias b, the solution a“ is obtainedby maximizing J(cx, b) under the conditions O!k ~ O.Based on equations 7 and 16, the resulting decision func-tion is of the form

D(X) = W* . V(X)+ b (18)

‘x ~~~~~~(x~, x) + b, a; > 0,k

where only the supporting patterns appear in the sumwith nonzero weight,

The choice of the bias b gives rise to several variants ofthe algorithm. The two considered here are

1. The bias can be fixed a priori and not subjectedto training. This corresponds to the “GeneralizedPortrait Technique” described in [Vap82].

2. The cost function 17 can be optimized with respecttow and b. This approach gives the largest possiblemargin M* in q-space [VC74].

In both cases the solution is found with standard non-linear optimization algorithms for quadratic forms withlinear constraints [Lue84, Lo072]. The second approachgives the largest possible margin. There is no guaran-tee, however, that this solution exhibits also the bestgeneralization performance.

A strategy to optimize the margin with respect to bothw and b is described in [Vap82]. It solves problem 17 fordifferences of pattern vectors to obtain a“ independentof the bias, which is computed subsequently. The mar-gin in ~-space is maximized when the decision boundaryis halfway between the two classes. Hence the bias b*

is obtained by applying 18 to two arbitrary supportingpatterns xA e chiss A and xB c class B and taking intoaccount that D(xA) = 1 and D(xB) = –1.

b* = ‘j (w” . ~(xA) + W* ~$O(XB)) (19)

= ‘~~yk~~[~~(xA,xk)+ ~~(x~,xk)]

k=l

The dimension of problem 17 equals the size of the train-ing set, p. To avoid the need to solve a dual problem of

exceedingly large dimensionality, the training data is di-vided into chunks that are processed iteratively [Vap82].The maximum margin hypersurface is constructed forthe first chunk and a new training set is formed con-sisting of the supporting patterns from the solution andthose patterns xk in the second chunk of the trainingset for which yk D(xk ) < 1 – c. A new classifier istrained and used to construct a training set consistingof supporting patterns and examples from the first threechunks which satisfy yk D(xk) < 1 — c, This process isrepeated until the entire training set is separated.

3 PROPERTIES OF THE

ALGORITHM

In this Section, we highlight some important aspects ofthe optimal margin training algorithm. The descriptionis split into a discussion of the qualities of the resultingclassifier, and computational considerations. Classifica-tion performance advantages over other techniques willbe illustrated in the Section on experimental results.

3.1 PROPERTIES OF THE SOLUTION

Since maximizing the margin between the decisionboundary and the training patterns is equivalent tomaximizing a quadratic form in the positive quadrant,there are no local minima and the solution is alwaysunique if H has full rank. At the optimum

J(~*) = ;Ilw”[y = 12 (M*)2 = ;i~~ (20)

k=l

The uniqueness of the solution is a consequence of themaximum margin cost function and represents an im-portant advantage over other algorithms for which thesolution depends on the initial conditions or other pa-rameters that are difficult to control.

Another benefit of the maximum margin objective is itsinsensitivity to small changes of the parameters w ora. Since the decision function D(x) is a linear func-tion of w in the direct, and of cx in the dual space, theprobability of misclassifications due to parameter vari-ations of the components of these vectors is minimizedfor maximum margin, The robustness of the solution—and potentially its generalization performance—can beincreased further by omitting some supporting patternsfrom the solution. Equation 20 indicates that the largestincrease in the maximum margin M“ occurs when thesupporting patterns with largest Q!kare eliminated. Theelimination can be performed automatically or with as-sistance from a supervisor. This feature gives rise toother important uses of the optimum margin algorithmin database cleaning applications [MGB+92].

Figure 2 compares the decision boundary for a maxi-mum margin and mean squared error (MSE) cost func-tions. Unlike the MSE based decision function whichsimply ignores the outlier, optimal margin classifiers arevery sensitive to atypical patterns that are close to the

147

References

Page 24: A quick introduction to Statistical Learning Theory and Support Vector Machines

Figure 2: Linear decision boundary for MSE (left) and maximum margin cost functions (middle, right) in the presenceof an outlier. In the rightmost picture the outlier has been removed. The numbers reflect the ranking of supportingpatterns according to the magnitude of their Lagrange coefficient czk for each class individually.

decision boundary. These examples are readily iden-tified as those with the largest O!k and can be elimi-nated either automatically or with supervision. Hence,optimal margin classifiers give complete control overthe handling of outliers, as opposed to quietly ignoringthem.

The optimum margin algorithm performs automatic ca-pacity tuning of the decision function to achieve goodgeneralization. An estimate for an upper bound of thegeneralization error is obtained with the “leave-one-out”method: A pattern Xk is removed from the training set.A classifier is then trained on the remaining patternsand tested on xk. This process is repeated for all p

training patterns. The generalization error is estimatedby the ratio of misclassified patterns over p. For a max-imum margin classifier, two cases arise: ~f xk is not asupporting pattern, the decision boundary is unchangedand xk will be classified correctly. If xk is a supportingpattern, two cases are possible:

1. The pattern xk is linearly dependent on the othersupporting patterns. In this case it will be classifiedcorrectly.

2. Xk is linearly independent from the other support-ing patterns. In this case the outcome is uncertain.In the worst case m’ linearly independent support-ing patterns are misclassified when they are omit-ted from the training data.

Hence the frequency of errors obtained by this methodis at most m’/p, and has no direct relationship withthe number of adjustable parameters. The number oflinearly independent supporting patterns m’ itself isbounded by min(lV, p). This suggests that the numberof supporting patterns is related to an effective capac-ity of the classifier that is usually much smaller than theVC-dimension, IV+ 1 [Vap82, HLW88].

In polynomial classifiers, for example, IV % n~, wheren is the dimension of x-space and q is the order of the

polynomial. In practice, m < p << N, i.e. the numberof supporting patterns is much smaller than the dimen-sion of the ~-space. The capacity tuning realized by themaximum margin algorithm is essential to get general-ization with high-order polynomial classifiers.

3.2 COMPUTATIONAL CONSIDERATIONS

Speed and convergence are important practical consid-erations of classification algorithms. The benefit of thedual space representation to reduce the number of com-putations required for example for polynomial classifiershas been pointed out already. In the dual space, eachevaluation of the decision function D(x) requires m eval-uations of the kernel function l((x~, x) and forming theweighted sum of the results. This number can be fur-ther reduced through the use of appropriate search tech-niques which omit evaluations of Ii that yield negligiblecontributions to D(x) [Omo9 1].

Typically, the training time for a separating surfacefrom a database with several thousand examples is a fewminutes on a workstation, when an efficient optimiza-tion algorithm is used. All experiments reported in thenext section on a database with 7300 training examplestook less than five minutes of CPU time per separatingsurface. The optimization was performed with an al-gorithm due to Powell that is described in [Lue84] andavailable from public numerical libraries.

Quadratic optimization problems of the form stated in17 can be solved in polynomial time with the Ellipsoidmethod [NY83]. This technique finds first a hyperspacethat is guaranteed to contain the optimum; then thevolume of this space is reduced iteratively by a constantfraction. The algorithm is polynomial in the number offree parameters p and the encoding size (i. e. the accu-racy of the problem and solution). In practice, however,algorithms without guaranteed polynomial convergenceare more efficient.

148

References

Page 25: A quick introduction to Statistical Learning Theory and Support Vector Machines

❑ mm ❑alpha= 1.37 alpha= 1.05 alpha=O.747 alpha=O.641 alpha=O.651 alpha=O.556 alpha=O.549 alpha=O.544

❑ mn ❑ mumalpha=O.541 alpha= 0.54 alpha=O.495 alpha=O.454 alpha=O.512 alpha=O.445 alpha=O.444 alpha=O.429

Figure 3: Supporting patternscr~.

4 EXPERIMENTAL

from database DB2forclass 2 before cleaning. The patterns areranked accordingto

RESULTS

The maximum margin training algorithm has beentested on two databases with images of handwrittendigits. The first database (DB1) consists of 1200 cleanimages recorded from ten subjects. Half of this data isused for training, and the other half is used to evaluatethe generalization performance. A comparative analy-sis of the performance of various classification methodson DB1 can be found in [GVB+92, GPP+89, GBD92].The other database (DB2) used in the experiment con-sists of 7300 images for training and 2000 for testingand has been recorded from actual mail pieces. Resultsfor this data have been reported in several publications,see e.g. [CBD+ 90]. The resolution of the images in bothdatabases is 16 by 16 pixels.

In all experiments, the margin is maximized with re-spect to w and b. Ten hypersurfaces, one per class, areused to separate the digits. Regardless of the difficultyof the problem—measured for example by the number ofsupporting patterns found by the algorithm—the samesimilarity function K(x, x’) and preprocessing is usedfor all hypersurfaces of one experiment. The results ob-tained with different choices of K corresponding to lin-ear hyperplanes, polynomial classifiers, and basis func-tions are summarized below. The effect of smoothing isinvestigated as a simple form of preprocessing.

For linear hyperplane classifiers, corresponding to thesimilarity function K(x, x’) = x. x’, the algorithm findsan errorless separation for database DB1. The percent-age of errors on the test set is 3.270. This result com-pares favorably to hyperplane classifiers which minimizethe mean squared error (backpropagation or pseudo-inverse), for which the error on the test set is 12.7 ‘%0.

Database DB2 is also linearly separable but containsseveral meaningless patterns. Figure 3 shows the sup-porting patterns with large Lagrange multipliers a~ forthe hyperplane for class 2. The percentage of misclassi-fications on the test set of DB2 drops from 15.2% with-out cleaning to 10.5 ‘%0after removing meaningless andambiguous patterns.

Better performance has been achieved with bothdatabases using multilayer neural networks or other

classification functions with higher capacity than linearsubdividing planes. Tests with polynomial classifiers oforder q, for which K(x, x’) = (x . x’ + 1)9, give thefollowing error rates and average number of support-ing patterns per hypersurface, <m>. This average iscomputed as the total number of supporting patternsdivided by the number of decision functions. Patternsthat support more than one hypersurface are countedonlv once in the total. For com~arison, the dimensionN ~f ~-space is also listed. -

G

r1 linear2345

DB1 DB2error <m> error <m>

m

N

256

3 ~1048.107

4.1091.1012

The results obtained for DB2 show a strong decreaseof the number of supporting patterns from a linear toa third order polynomial classification function and anequivalently significant decrease of the error rate. Fur-ther increase of the order of the polynomial has little ef-fect on either the number of supporting patterns or theperformance, unlike the dimension of V-space, IV, whichincreases exponentially. The lowest error rate, 4.9’10 isobtained with a forth order polynomial and is slightlybetter than the 5.1 % reported for a five layer neural net-work with a sophisticated architecture [CBD+90], whichhas been trained and tested on the same data.

In the above experiment, the performance changes dras-tically between first and second order polynomials. Thismay be a consequence of the fact that maximum VC-dimension of an q-th order polynomial classifier is equalto the dimension n of the patterns to the q-th powerand thus much larger than n. A more gradual changeof the VC-dimension is possible when the function K ischosen to be a power series, for example

K(x, x’) = exp (-yx . x’) – 1. (21)

In this equation the parameter y is used to vary the VC-dimension gradually. For small values of y, equation21 approaches a linear classifier with VC-dimension at

149

References

Page 26: A quick introduction to Statistical Learning Theory and Support Vector Machines

Figure 4: Decision boundaries for maximum margin classifiers with second order polynomial decision rule If(x, x’) =(x. x’+ 1)2 (left) and an exponential RBF I{(x, x’) = exp(–llx – x’11/2) (middle). The rightmost picture shows thedecision boundary of a two layer neural network with two hidden units trained with backpropagation.

most equal to the dimension n of the patterns plus one.Experiments with database DB1 lead to a slightly bet-ter performance than the 1.570 obtained with a secondorder polynomial classifier:

TDB1

0:5 2.3 00.50 2.2%0.75 1.3%1.00 1.5%

When I<(x, x’) is chosen to be the hyperbolic tangent,the resulting classifier can be interpreted as a neuralnetwork with one hidden layer with m hidden units. Thesupporting patterns are the weights in the first layer,and the coefficients ak the weights of the second, linearlayer. The number of hidden units is chosen by thetraining algorithm to maximize the margin between theclasses A and B. Substituting the hyperbolic tangent forthe exponential function did not lead to better resultsin our experiments.

The importance of a suitable preprocessing to incorpo-rate knowledge about the task at hand has been pointedout by many researchers. In optical character recogni-tion, preprocessing that introduce some invariance toscaling, rotation, and other distortions are particularlyimportant [SLD92]. As in [C,VB+ 92], smoothing is usedto achieve insensitivity to small distortions. The tablebelow lists the error on the test set for different amountsof smoothing. A second order polynomial classifier wasused for database DB1, and a forth order polynomial forDB2. The smoothing kernel is Gaussian with standarddeviation u.

E

E

no smoothing0.50.81.01.2

DB1error <m>

1.5’%0 441.3% 410.8 % 360.3% 310.8% 31

DB2error <m>

4.970 724.6 % 735.0% 796.0% 83

The performance improved considerably for DB1. ForDB2 the improvement is less significant and the opti-mum was obtained for less smoothing than for DB1.This is expected since the number of training patternsin DB2 is much larger than in DB1 (7000 versus 600). Ahigher performance gain can be expected for more selec-tive hints than smoothing, such as invariance to smallrotations or scaling of the digits [SLD92].

Better performance might be achieved with other sim-ilarity functions I<(x, x’). Figure 4 shows the decisionboundary obtained with a second order polynomial anda radial basis function (RBF) maximum margin classi-fier with 1{(x, x’) = exp (–11x – x’11/2). The decisionboundary of the polynomial classifier is much closer toone of the two classes. This is a consequence oft he non-linear transform from q-space to x-space of polynomialswhich realizes a position dependent scaling of distance.Radial Basis Functions do not exhibit this problem. Thedecision boundary of a two layer neural network trainedwith backpropagat ion is shown for comparison.

5 CONCLUSIONS

Maximizing the margin between the class boundaryand training patterns is an alternative to other train-ing methods optimizing cost functions such as the meansquared error. This principle is equivalent to minimiz-ing the maximum loss and has a number of importantfeatures. These include automatic capacity tuning ofthe classification function, extraction of a small num-ber of supporting patterns from the training data thatare relevant for the classification, and uniqueness of thesolution. They are exploited in an efficient learning al-gorithm for classifiers linear in their parameters withvery large capacity, such as high order polynomial orRBF classifiers. Key is the representation of the deci-sion function in a dual space which is of much lowerdimensionality than the feature space.

The efficiency and performance of the algorithm havebeen demonstrated on handwritten digit recognition

150

References

Page 27: A quick introduction to Statistical Learning Theory and Support Vector Machines

problems. The achieved performance matches that ofsophisticated classifiers, even though no task specificknowledge has been used. The training algorithm ispolynomial in the number of training patterns, evenin cases when the dimension of the solution space (p-space) is exponential or infinite. The training time inall experiments was less than an hour on a workstation.

Acknowledgements

We wish to thank our colleagues at UC Berkeley andAT&T Bell Laboratories for many suggestions and stim-ulating discussions. Comments by L. Bottou, C. Cortes,S. Sanders, S. Solla, A. Zakhor, and the reviewers aregratefully acknowledged. We are especially indebtedto R. Baldick and D. Hochbaum for investigating thepolynomial convergence property, S. Hein for providingthe code for constrained nonlinear optimization, andD. Haussler and M. Warmuth for help and advice re-garding performance bounds.

References

[ABR64]

[BH89]

[BL88]

[CBD+90]

[CH53]

[DH73]

[GBD92]

[GPP+89]

M.A. Aizerman, E.M. Braverman, and L.I.Rozonoer. Theoretical foundations of thepotential function method in pattern recog-nition learning. Automation and RemoteControl, 25:821-837, 1964.

E. B. Baum and D. Haussler. What size netgives valid generalization? Neural Compu-tation, 1(1):151–160, 1989.

D. S. Broomhead and D. Lowe. Multi-variate functional interpolation and adap-tive networks. Complex Systems, 2:321-355, 1988.

Yann Le Cun, Bernhard Boser, John S.Denker, Donnie Henderson, Richard E.Howard, Wayne Hubbard, and Larry D.Jackel. Handwritten digit recognition witha back-propagation network. In David S.Touretzky, editor, Neural Information Pro-

cessing Systems, volume 2, pages 396–404.Morgan Kaufmann Publishers, San Mateo,CA, 1990.

R. Courant and D. Hilbert. Methods of

mathematical physics. Interscience, NewYork, 1953.

R.O. Duda and P.E. Hart. Pattern Classifi-

cation And Scene Analysis. Wiley and Son,1973.

S. Geman, E. Bienenstock, and R. Dour-sat. Neural networks and the bias/variancedilemma. Neural Computation, 4 (1):1 -58,1992.

I. Guyon, I. Poujaud, L. Personnaz,G. Dreyfus, J. Denker, and Y. LeCun. Com-paring different neural network architec-tures for classifying handwritten digits. In

[GVB+92]

[HLW88]

[KM87]

[Lo072]

[Lue84]

[Mac92]

[MD89]

Proc. Ini. Joint Conf. Neural Networks. Int.Joint Conference on Neural Networks, 1989.

Isabelle Guyon, Vladimir Vapnik, BernhardBoser, Leon Bottou, and Sara Solla. Struc-tural risk minimization for character recog-nition. In David S. Touretzky, editor, Neural

Information Processing Systems, volume 4.Morgan Kaufmann Publishers, San Mateo,CA, 1992. To appear.

David Haussler, Nick Littlestone, and Man-fred Warmuth. Predicting O,l-functions onrandomly drawn points. In Proceedings of

the 29th Annual Symposium on the Founda-

tions of Computer Science, pages 100-109.IEEE, 1988.

W. Krauth and M. Mezard. Learning algo-rithms with optimal stability in neural net-works. J. Phys. A: Math. gen., 20: L745,

1987.

F. A. Lootsma, editor. Numerical Meth-

ods for Non-linear Optimization, AcademicPress, London, 1972.

David Luenberger. Linear and Nonlinear

Programming. Addison-Wesley, 1984.

D. MacKay. A practical bayesian frameworkfor backprop networks. In David S. Touret-zky, editor, Neurai Information Processing

Systems, volume 4. Morgan Kaufmann Pub-lishers, San Mateo, CA, 1992. To appear.

J. Moody and C. Darken. Fast learning innetworks of locally tuned processing units.Neural Compuiaiaon, 1 (2)~281 -294, 1989.

IMGB+921 N. Matic, I. Guyon, L. Bottou, J. Denker,

[Mo092]

[NY83]

[Omo91]

[PG90]

and V. Vapnik. Computer-aided cleaningof large databases for character recognition.In Digest ICPR. ICPR, Amsterdam, August1992.

J. Moody. Generalization, weight decay, andarchitecture selection for nonlinear learn-ing systems. In David S. Touretzky, edi-tor, Neural Information Processing Systems,

volume 4. Morgan Kaufmann Publishers,San Mateo, CA, 1992. To appear.

A.S. Nemirovsky and D. D. Yudin. Problem

Complexity and Method Eficiency in Opti-

mization. Wiley, New York, 1983.

S.M. Omohundro. Bumptrees for efficientfunction, constraint and classification learn-ing. In R.P. Lippmann and et al., editors,NIPS-90, San Mateo CA, 1991. IEEE, Mor-gan Kaufmann.

T. Poggio and F. Girosi. Regularization al-gorithms for learning that are equivalent tomultilayer networks. Science, 247:978 – 982,

February 1990.

151

References

Page 28: A quick introduction to Statistical Learning Theory and Support Vector Machines

[Pog75] T. Poggio. On optimal nonlinear associativerecall. Biol. Cybernetics, Vol. 19:201–209,1975.

[ROS62] F. Rosenblatt. Princzp!es of neurodynamics.

Spartan Books, New York, 1962.

[SLD92] P. Simard, Y. LeCun, and J. Denker. Tan-gent prop—a formalism for specifying se-lected invariance in an adaptive network.In David S. Touretzky, editor, Neural Infor-

mation Processing Systems, volume 4. Mor-gan Kaufmann Publishers, San Mateo, CA,1992. To appear.

[TLS89] N. Tishby, E. Levin, and S. A. Solla. Con-sistent inference of probabilities in layerednetworks: Predictions and generalization.In Proceedings of the International Joint

Conference on Neural Networks, Washing-ton DC, 1989.

[Vap82] Vladimir Vapnik. Estimation of Depen-

dence Based on Empirical Data. SpringerVerlag, New York, 1982.

[VC74] V.N. Vapnik and A.Ya. Chervonenkis. The

theory of pattern recognition. Nauka,Moscow, 1974.

152

References

Page 29: A quick introduction to Statistical Learning Theory and Support Vector Machines

�������������� ������� ������ ��������������! " "#$���%��&('*),+-#. %/10,2!#�/"�!34�!�506#87" %�:9<;=6>@?�>�AB#.C%)�DFEG'*)('H#.��I*JLKNM,O%=PRQTS*UHVXWZY�U.[ >@J%']\*^*_._"`.acbedFfHgchHbi`8a1jZgXbikl`.anmo�p)q#r %'*st2p'H#.�( %�! %uv3$#�I*J%�! %'1wi���x&�s���Dyu�����z"7I*2!#�)()(�|{ZIH#�&n�}�% o7"���%C%2}'H3~)H�6>5J%'-3�#�I�J%�! %'5I*�% %I*'H7%&�z"#�2p2p�o�!3$7%2p'H34'H 8&()6&cJ%'�w��82}2p��s5�! %u4�p/"'H#"�T�: "7"z%&� '*I*&(�%��)�#.��'� %�� 8Dy2p�! %'H#.��2p�13$#87"7"'*/R&(�4# � 'H���$J%�}u%J8Dy/"�:3~'H %)(�p�� �w�'H#�&cz"��'-)c7"#�I*'.�l�y �&�J%�p)�w�'H#�&�z"��')�7"#�I*'�#~2}�! %'H#.�@/"'*I*�p)(�p�� R)cz"��w�#�I*'��}),I*�� %)(&c�(z%I*&('*/B�lO"7"'*I*�!#�2l7"����7�'H��&(�p'*)-��w�&�J%'�/"'*I*�p)(�p�� <)�z"��wy#�I*''H %)�z"��'*)lJ%�pu�J�u8'H %'H�(#�2p�p�H#�&(�p�� 4#.C%�p2p�}&��x��w�&cJ%'l2}'H#8�( %�! %u�3$#�I*J%�: %'.�l>5J%'��p/"'H#-C�'HJ%�: %/4&�J%'�)�z"7"7"�%��&�D� '*I*&(�%�� %'*&Fs����n9Rs5#�)�7"��' � �p��z%)(2p�]�!3$7%2p'H34'H 8&('*/]wi����&�J%'$��'*)(&����pI*&('*/�IH#�)('4s-J%'H��'$&�J%'$&��n#��! %�: %u/Z#�&�#rIH#. �C"'�)('H7"#.�(#�&('*/�s5�p&�J%�%z%&R'H�(������)X���]']J%'H��'�'*��&('H %/�&�J%�p)R��'*)�z%2p&R&(�v %�� 8Dy)('H7"#.�n#.C%2p'&��(#��: %�! %u�/Z#�&�#"�� �pu�Jvu8'H %'H�(#�2p�p�H#�&(�p�� �#.C%�p2p�p&�����w�)�z"7"7�����&�D � '*I*&(���� %'*&Fs����n9.)$z%&(�p2p�p�*�: %u�7��.2p�� %��3~�:#�2��: "7"z%&&��(#8 %)�w��%�(3$#�&(�p�� %)4�p)$/"'H34�% %)(&��(#�&n'*/B�v�]'1#�2p)(�]I*�%3$7"#.��'R&�J%'17�'H��wi���(3$#. %I*'<��w-&�J%'<)�z"7"7"�%��&�D� '*I*&(�%�� %'*&�s����(9,&(� � #8���p��z%)6I*2!#�)()n�}IH#�2"2}'H#8�( %�! %u�#�2pu.�����}&cJ"34)l&�J"#�&l#�2p2"&(�%��9�7"#.��&T�! $#-C�'H %I�J"3�#.�(9)(&�z%/"����w6��7%&(�pIH#�2l�5J"#.�(#�I*&n'H��EG'*I*�.u� %�p&(�p�� L�

 �¡�¢¤£�¥LV�¦�S8§6¨ #�&(&('H�n o��'*I*�8u� %�p&(�p�� LKZ'c©�I*�}'H 8&,2}'H#8�( %�! %uq#�2pu.�%���p&�J"34)HKZ %'Hz"�n#�2� %'*&�s����(98)HK"�(#XD/"�!#�2�C"#�)n�})@wyz" %I*&(�p�� RI*2!#�)()n�ª{Z'H��)HKN7��.2p�� %�%34�!#�2�I*2!#�)n)(�|{Z'H��)H�« ¬®­5¯®°B±�²�³�´T¯Nµ(±�­¶ ����'5&cJ"#. o·8¸��8'H#.��)�#�u.��E��%=4�%¹l�})cJ%'H�-º|»*¼B)�z%u8u.'*)(&('*/�&�J%'@{N��)(&�#�2}u8�����p&�J"3½w��%��7"#�&(&('H�( $��'*I*�.u�D %�p&(�p�� L� � '4I*�� %)(�p/"'H��'*/�#R34��/"'*2T��w�&Fs��o %�%�(3$#�2T/"�p)(&����!C"z%&('*/�7���7"z%2!#�&(�p�� %)HKT¾v¿�À +*Á*Â�+Fà #. %/¾v¿�À ; Á* ; à ��w�Ä]/"�!34'H %)n�}�% "#�2 � '*I*&(����)-Å�s5�p&�J�34'H#. � '*I*&(����)xÀ + #. %/�À ; #. %/�I*��D � #.���:#8 %I*'3$#�&c���pI*'*)  + #8 %/  ; KÆ#8 %/])�J%��s�'*/]&�J"#�&�&�J%'���7%&(�!3$#�2�¿iÇ�#H�8'*)(�!#. à )(�.2!z%&(�p�� r�p)~#�È�z"#�/Z�(#�&(�pI/"'*I*�p)(�p�� �w�z" %I*&n�}�% L�É )(È ¿�Å ÃlÊ )(�pu� 1ËXÌÍ ¿�ÅqÎ�À +�Ã�ÏTÂ$Ð ++ ¿eÅRÎvÀ +(à Î�ÌÍ ¿eÅRÎvÀ ;cÃ�ÏTÂ$Ð +; ¿eÅRÎvÀ ;�ÃBÑ 2! �Ò Â ; ÒÒ Â + ÒÔÓ�Õ ¿ Ì Ã�y r&�J%'�IH#�)('$s,J%'H��'  + ÊÖ ; Ê× &�J%'oÈ"z"#�/Z�(#�&(�pI$/"'*I*�p)(�p�� rw�z" %I*&n�}�% �¿ Ì Ã /"'*u.'H %'H�(#�&('*)4&(�1#2p�! %'H#.��wyz" %I*&(�p�� L�É 2}�! ¿�Å ÃlÊ )(�pu� Ë ¿�À + Î�À ;*Ã Ï Â Ð + Å1Î ÌÍ ¿�À Ï +  Р+ À + Î�À Ï ;  Р+ À ;cÃ Ó Õ ¿ Í Ã>¤�-'*)(&n�:3�#�&('6&cJ%'6È"z"#�/Z�(#�&(�pI�/"'*I*�p)(�p�� �w�z" %I*&n�}�% ��% %'GJ"#�)�&(�@/"'*&('H�(34�! %'$Ø8Ù}Ø�ÚZÛcÜ; w���'*'�7"#.�(#.3~'*&('H��)H�>¤�R'*)(&(�!3$#�&('4&�J%'42p�! %'H#.�,wyz" %I*&(�p�� ��% %2}�]Ärwy��'*'$7"#.�(#.3~'*&('H��)�J"# � '�&n�qC�'$/"'*&('H�(34�! %'*/B�$�� �&�J%'ÝiÞ�ß(à*áeâªã�äæåHçéènêéäcëTì!í�înï*ðeï�ñcò(óeô�ñ*î(òXõBö"óeã�ßnâø÷!ëLùFènúiâøêHê�ß�û¤úyä�üiä�ßnúyùiç�õ ß�áiá�õ ù�è�ãý Þ�ß(à*áeâªã�äæåHçéènêéäcëTìþî(ï*ÿcðeï�����óeï�ï���ÿ õBö"óeã�ßnâø÷!ë��c÷ªß���û¤úyäFüiä�ßnúyùiç%õ ßcáiá�õ ù�è�ã

Ì

References

UPCLab2013
Typewritten text
REF [3]
Page 30: A quick introduction to Statistical Learning Theory and Support Vector Machines

output from the 4 hidden unitsweights of the 4 hidden units

dot−products

weights of the 5 hidden units

dot−products

dot−product

perceptron output

α 1, ... ,α

1

input vector, x

5

5

weights of the output unit,

z , ... , zoutput from the 5 hidden units:

¹l�pu�z"��' Ì �B= )(�!3$7%2p',w�'*'*/%Diwi����s5#.��/�7"'H��I*'H7%&����� $s@�}&cJ����: "7"z%&,z" %�}&n)HK Í 2!#X�.'H��)6��w¤J%�p/"/"'H <z" %�}&n)HK#. %/ Ì �%z%&�7"z%&-z" %�p&H��>5J%'�u��(#X�8Dy)�J"#�/"�! %u$��wl&�J%' � '*I*&(�%��'H .&c���p'*)���'��Z'*I*&n)5&�J%'*�!�- %z"34'H���pI � #�2!z%'.�IH#�)('�s-J%'H��'$&�J%'$ %z"34C"'H����w���C%)('H� � #�&(�p�� %)x�p)�)c3$#�2p2-¿e)�#X�12p'*)()x&�J"#. Ì ¸�Ä ; à '*)n&(�!3$#�&(�! %u��"¿iÄ ; Ã7"#.�(#834'*&('H��)��})5 %�.&5��'*2p�!#.C%2p'.��¹l�})cJ%'H��&�J%'H��'cwi����'-��'*I*��3$34'H %/"'*/BK"' � 'H $�! �&�J%',IH#�)('5��w  +��Ê ; K&(�oz%)n'�&�J%'�2p�: %'H#8�,/"�})nIH���!34�! "#�&(���Gw�z" %I*&n�}�% �¿ Í Ã s5�p&�J  ��wl&�J%',wi���(3R� Ê�®Â + Ñ ¿ Ì Î �"Ãc ; Á ¿� Ãs-J%'H��' � �})R)n��34'1I*�% %)(&�#. 8& Û ��¹l�p)�J%'H�R#�2p)(�v��'*I*��3$3~'H %/"'*/�#]2p�! %'H#.��/"'*I*�p)(�p�� �wyz" %I*&(�p�� �wi���&�J%'5IH#�)('5s-J%'H��'@&�J%'5&�s��-/"�p)(&����:C"z%&n�}�% %)-#.��'5 %�.&6 %���n3$#�2i�l=,2}u8�����p&�J"34)æw����l7"#�&(&('H�( 4��'*I*�.u% %�}&n�}�% s�'H��'l&�J%'H��'cwi����'Twy����3Ö&�J%' � 'H����C"'*u.�! " %�! %u�#�)()(�%I*�:#�&('*/�s5�p&�J�&cJ%'lI*�� %)(&��(z%I*&n�}�% ���w"2p�: %'H#8�B/"'*I*�p)(�p�� )�z"��wy#�I*'*)H��� Ì � · Í E��.)('H %C%2!#�&(&�º Ì8Ì ¼T'*��7%2p����'*/]#4/"���B'H��'H 8&-98�: %/���wl2p'H#.�( %�! %uR3$#�I*J%�: %'*)X��7�'H��I*'H7%&����� %)���$ %'Hz"�n#�2- %'*&�s����(98)H��>5J%'�7"'H��I*'H7%&c���� vI*�% %)(�p)(&()$��w�I*�� " %'*I*&('*/� %'Hz"���� %)HK�s-J%'H��'<'H#�I*Jv %'Hz8D���� r�!3$7%2p'H34'H 8&()o#<)('H7"#.�(#�&(�! %u J8�"7"'H�(7%2!#. %'.K�)(�<&�J%'17"'H��I*'H7%&����� �#�)$#<s-J%�.2p'R�!3$7%2p'H34'H .&n)o#7%�p'*I*'*s5�p)('42p�: %'H#8�,)('H7"#.�(#�&n�: %u�)�z"��wy#�I*'.�6O%'*'�¹l�pu�z"��' Ì �>@J%'o7"����C%2p'H3 ��w�{N %/"�! %u�#. r#�2pu.�����p&�J"3 &�J"#�&�3~�: %�!34�p�*'*)�&�J%'�'H�(�������� ]#R)n'*&���w � '*I*&(����)C8��#�/��(z%)(&(�! %u�#�2p2�&�J%'Rs�'*�pu�J.&n)���wG&cJ%'1 %'*&Fs����n91s5#�)$ %�.&xwi��z" %/��: vEG�.)('H %C%2:#�&(& �ø)4&(�!34'.K�#. %/E��.)n'H �C%2!#�&(&@)�z%u.u.'*)(&n'*/q#4)nI�J%'H34'�s,J%'H��'��� %2p�o&cJ%'�s�'*�pu�J8&()���w�&cJ%'���z%&�7"z%&5z" %�p&�#.��'�#�/Z#87%&(� � '.�=,I*I*����/"�! %u�&(�1&�J%'~{Z��'*/�)('*&(&(�! %u<��w5&�J%'$�.&�J%'H��s�'*�}u%J.&()x&�J%'$�! "7"z%& � '*I*&(����)�#.��'� %�� 8Dy2p�: %'H#8��2p�&��(#8 %)�w��%�(34'*/1�! 8&(��&�J%'xw�'H#�&cz"��'�)�7"#�I*'.K���KL��w6&cJ%'�2!#�)n&-2!#X�.'H�,��w�z" %�p&()H���� �&�J%�p)�)�7"#�I*'4#�2}�! %'H#.�/"'*I*�p)(�p�� �w�z" %I*&n�}�% R�p)-I*�% %)(&��(z%I*&('*/B�� ¿eÅ Ã�Ê )n�}u% �������� ����� ¿�Å Ã! ¿#" ÃC8�q#�/��(z%)(&(�! %u$&cJ%'�s�'*�pu�J8&()$� � w����%3 &cJ%'&%eDy&�JRJ%�p/"/"'H �z" %�p&5&(�$&�J%'��%z%&�7"z%&5z" %�}&,)(�o#�)�&n�~3~�: %�|D34�p�*'4)(��34'�'H�n�����,34'H#�)�z"��'x� � 'H��&�J%'x&��(#��! %�! %uo/Z#�&�#"��=,)-#o��'*)�z%2p&-��w�EG�.)('H %C%2:#�&(& �ø)�#.7"7"����#�I�JLK'!( çéälènåéáeâªã�ßn÷ZùFè*ä*)5ù�âªä�êHá�+!ènú-,/.Lßcü0+!è21Xê �@âøê�áeçéälüeâ43*áeâªä�ü65 ÿ87þõ

Í

References

Page 31: A quick introduction to Statistical Learning Theory and Support Vector Machines

I*�� %)(&c�(z%I*&(�p�� <��w�/"'*I*�p)(�p�� ��(z%2p'*)�s5#�)�#�u�#��: �#�)n)(��I*�!#�&n'*/]s5�p&�J<&�J%'4I*�� %)(&��nz%I*&(�p�� ���w�2}�! %'H#.�xJ.�8D7"'H�n7%2:#8 %'*)��: �)(��34'�)c7"#�I*'.�=� �#�2pu.�����p&�J"3 &cJ"#�&$#�2p2}��s5)~w����x#�2p25s�'*�pu�J.&n)���w-&�J%'1 %'Hz"�(#�25 %'*&�s����(9�&n� #�/Z#.7%&4�: ��%��/"'H�2p��IH#�2}2p��&(��3~�: %�!34�p�*'�&�J%'1'H�n�����4�� �#])n'*&o��w � '*I*&(�%��)qC�'*2p�� %u.�! %uv&(�v#�7"#�&n&('H�( v��'*I*�.u% %�}&n�}�% 7"����C%2p'H3 s5#�)xwi��z" %/v�! Ì � �.·rº Ì Í K Ì %K Ì ¸"K �X¼5s-J%'H �&cJ%'qC"#�I*9�DF7"����7"#�u�#�&(�p�� �#�2}u8�����p&�J"3 s5#�)/"�p)(I*� � 'H��'*/B�1>@J%'o)(�82:z%&n�}�% ��: � �.2 � '*)$#�)(2p�}u%J.&$34�%/"�|{ZIH#�&(�p�� ���w5&�J%'o3�#�&�J%'H3$#�&n�}IH#�2�34�%/"'*26��w %'Hz"���� %)X�5>5J%'H��'cwi����'.KL %'Hz"�(#�2l %'*&Fs����n9.)��!3$7%2p'H34'H 8&���7%�p'*I*'cDys5�p)('42p�! %'H#.��Dy&���7�'��1/"'*I*�p)(�p�� �wyz" %IcD&(�p�� %)H��� R&�J%�p)�#.��&(�pI*2p'�s�'�I*�� %)n&��(z%I*&-#$ %'*s&���7�'���wl2p'H#.�( %�! %uR3$#�I*J%�: %'*)XKZ&�J%'x)(��DyIH#�2p2p'*/�)�z"7"7"�%��&�D� '*I*&(�%�� %'*&�s����(9®� >5J%'R)cz"7"7"����&�D � '*I*&(���x %'*&Fs����n9��!3$7%2p'H34'H .&n)$&�J%'�w��.2p2p��s5�! %u]�p/"'H#"���p&~3�#.7%)&�J%'x�: "7"z%& � '*I*&(�%��)5�! .&n�o)(�%34'�J%�pu�JR/"�!34'H %)(�p�� "#�2Tw�'H#�&cz"��'�)�7"#�I*' �&�J"����z%u�J�)n��34'� %�� 8Dy2p�! %'H#.�3$#.7"7%�! %u�I�J%�8)('H �#]7"���}�%���i���� v&�J%�p)�)�7"#�I*'1#�2p�! %'H#.��/"'*I*�p)(�p�� �)�z"��wy#�I*'R�p)oI*�% %)(&��(z%I*&('*/�s5�p&�J)�7�'*I*�:#�2Æ7"���%7"'H��&(�p'*),&�J"#�&�'H %)�z"��'4J%�pu�JRu.'H %'H�n#�2p�}�H#�&(�p�� �#.C%�p2}�p&��R��wl&�J%'� %'*&�s����(9®�� Å W À��� ¡L§ >B����C%&c#��! v#1/"'*I*�p)(�p�� �)�z"�Fw�#�I*'�I*���(��'*)�7"�� %/"�! %u]&n� #�7��.2p�� %��3~�:#�2-��w,/"'*u���'*'&�s��ZK.�� %'xIH#. �IH��'H#�&n'�#xw�'H#�&�z"��'�)�7"#�I*'.K ��K"s,J%�}I*J1J"#�)-¾ Ê Ø.ÙpØéÚ®Û�Ü; I*�%����/"�! "#�&('*)5��wÆ&�J%',wi���(3��� + Ê� + Á ÕXÕHÕ Á � Ø Ê� Ø Á Ä1I*�%����/"�! "#�&n'*) Á� Ø�Ú +�Ê�

; + Á ÕXÕHÕ Á � ; Ø Ê�;Ø Á Ä<I*������/"�: "#�&('*) Á� ; Ø�Ú + Ê� + ; Á ÕXÕHÕ Á � � Ê� Ø Ø Ð + Á Ø8Ù}Ø Ð + Ü; I*�%����/"�! "#�&('*) Ás-J%'H��'�Å Ê ¿ + Á ÕHÕHÕ Á Ø Ã ��>5J%'�J.�"7"'H�n7%2:#8 %'��p),&�J%'H �I*�� %)(&c�(z%I*&('*/1�! �&�J%�p)5)�7"#�I*'.�>Gs��$7"���%C%2}'H3~)�#.���p)('��! �&�J%'�#.C�� � '�#.7"7"����#�I*JL�l�� %'�I*�% %I*'H7%&�z"#�2l#. %/1�� %'x&('*I*J" %�}IH#�2��,>5J%'I*�� %I*'H7%&�z"#�2N7"����C%2p'H3 �p)5J%�és�&(�,{N %/R#�)('H7"#.�n#�&(�! %u�J8��7�'H�(7%2!#. %'�&�J"#�&ls@�}2p2Lu.'H %'H�(#�2}�p�*'�s�'*2p2i�l&�J%'/"�!34'H %)(�p�� "#�2p�p&��1��wl&�J%'�w�'H#�&cz"��'�)�7"#�I*'xs5�p2}2lC"'4J�z%u8'.KZ#. %/ %�8&-#�2p2�J8��7�'H�(7%2!#. %'*)�&�J"#�&�)('H7"#.�n#�&('&�J%'4&��(#��: %�! %uR/Z#�&c#$s5�p2p2� %'*I*'*)n)�#.���p2p��u.'H %'H�(#�2p�p�*'$s�'*2p2��%��>5J%'4&('*I*J" %�pIH#�267"����C%2p'H3×�p)�J%��sI*��3xD7"z%&�#�&n�}�% "#�2p2}��&(�R&c��'H#�&,)�z%I�JrJ%�}u%J8Dy/"�:3~'H %)(�p�� "#�2�)�7"#�I*'*)H�,&(�RI*�% %)(&��(z%I*&�7��.2p�� %��3~�:#�26��w�/"'*u���'*'"1�%�����! v# Í ¸.¸�/"�!34'H %)(�p�� "#�25)c7"#�I*'o�p&$3$#X�]C"'1 %'*I*'*)()c#.���]&(�<I*�� %)(&��nz%I*&�J8��7�'H�(7%2!#. %'*)o�! �#C%�p2p2}�p�� /"�:3~'H %)(�p�� "#�2¤w�'H#�&cz"��'�)�7"#�I*'.�>@J%'�I*�� %I*'H7%&�z"#�2l7"#.��&5��wÆ&�J%�p)�7"����C%2p'H3�sG#�)5)(�.2 � '*/<�: Ì � ·���º Ì "X¼Bwi����&�J%'xIH#�)('���w5`c_Nb��������

���n_�gXai_����8jZg*\@w��%�5)('H7"#8�(#.C%2p'$I*2!#�)()('*)X��=� ���7%&(�!3$#�2ÆJ8��7�'H�(7%2!#. %'��})4J%'H��'4/"'c{N %'*/ #�)�&�J%'42}�! %'H#.�/"'*I*�p)(�p�� Öw�z" %I*&(�p�� s5�p&�J�3�#����!3$#�2�3$#.��u.�! �C�'*&�s�'*'H Ö&�J%' � '*I*&(�%��)R��w�&cJ%']&�s���I*2!#�)()('*)XK�)('*'¹l�pu�z"��' Í ���i&,s5#�),��C%)('H� � '*/�&�J"#�&-&(��I*�� %)(&c�(z%I*&-)cz%I�J<��7%&(�!3$#�2ÆJ8��7�'H�(7%2!#. %'*)��% %'��% %2}��J"#�)�&(�&�#.9�'$�! .&(�]#�I*I*�%z" .&$#R)c3$#�2p25#.34��z" 8&���w,&�J%'o&c�(#��! %�! %u]/Z#�&�#"K¤&�J%'�)(��IH#�2p2}'*/ \�^ _._"`.a�b�fXg�hXbi`.an\�Ks-J%�pI*J]/"'*&('H�n34�! %'$&�J%�p)�3$#8��u.�! L�o��&�sG#�)�)�J%��s- �&�J"#�&��|w5&�J%'$&c�(#��! %�! %u � '*I*&(�%��)�#.��'4)('H7"#.�n#�&('*/s5�p&�J%��z%&�'H�(������)�C.�v#. ���7%&(�!3$#�2-J.�"7"'H�n7%2:#8 %'1&�J%'<'*��7�'*I*&�#�&(�p�� � #�2!z%' ��w�&�J%'17"����C"#.C%�p2p�}&��v��wI*��3$3~�}&n&(�! %u~#8 4'H�(�����æ�� $#�&('*)(&6'*�"#.3$7%2p',�p)GC���z" %/"'*/1C8��&cJ%'��(#�&(�p��C"'*&�s�'*'H �&�J%'5'*�"7"'*I*&�#�&n�}�% � �Bä�ùFßn÷ø÷"!%âªüeçéä�ú$# ü�ù�ènêéù�ä�úiêéüTß$%.è 1éáTüiã�ßn÷ø÷"ß�ã�è 1HêHáiüTè2+ �Xß�áiß�ßnê ��áeçéä'&�1�ß�� úyß�áeâªù � âªüiùyúiâªã�âøê�ß(êHá + 1Xêéù�áeâªènê�õ

References

Page 32: A quick introduction to Statistical Learning Theory and Support Vector Machines

optimal margin

optimal hyperplane

¹l�pu�z"��' Í �-=� �'*�"#.3$7%2p'���wG#�)('H7"#.�(#.C%2p'o7"����C%2p'H3��! �# Í /"�!34'H %)(�p�� "#�2�)�7"#�I*'.�o>@J%'$)�z"7"7"�%��&� '*I*&(�%��)HK"3$#.�(9�'*/os5�p&�J�u���'*��)(È"z"#.��'*)HK%/"'c{N %'�&cJ%'�3$#.��u.�! ���wl2!#.��u.'*)n&5)('H7"#.�(#�&(�p�� RC"'*&�s�'*'H �&�J%'&�s���I*2!#�)n)('*)H�� #�2:z%'x��wl&�J%'� �z"34C�'H�5��wl)�z"7"7�����& � '*I*&n����)-#. %/R&cJ%'� �z"34C"'H�@��w�&c�(#��! %�! %u � '*I*&n����)H�

� º ¨ �*¿e'H�(���%� à ¼�� � ºª �z"34C�'H�6��wÆ)�z"7"7�����& � '*I*&(����)�¼ %z"34C"'H����wÆ&��n#��! %�: %u � '*I*&(�%��) Õ ¿�� Ã� �.&('.K�&�J"#�&4&�J%�p)qC���z" %/�/"�%'*)o %�.&�'*��7%2p�pI*�p&(2p��I*�% .&�#��: �&cJ%'R/"�:3~'H %)(�p�� "#�2p�p&F����w�&�J%'R)�7"#�I*'R��w)('H7"#.�n#�&(�p�� L����&Bwi�.2p2p��s5)Bwy����3Ö&�J%�p)ÆC���z" %/BK�&�J"#�&B�|wZ&�J%'���7%&(�!3$#�2%J.�"7"'H�(7%2!#. %'@IH#. �C"'�I*�� %)(&c�(z%I*&('*/wy����3�#@)�3$#�2p2" �z"34C�'H�B��wZ)cz"7"7"����& � '*I*&(�%��)Æ��'*2:#�&(� � '6&(�,&�J%'�&��(#��! %�! %u�)('*&B)(�p�*'�&�J%'�u.'H %'H�(#�2}�p�H#�&n�}�% #.C%�p2p�p&F��s5�p2p2�C�'�J%�pu�J�� ' � 'H ��! �#. ��! 8{N %�}&n'�/"�!34'H %)(�p�� "#�2�)�7"#�I*'.� �y O%'*I*&(�p�� ��s�'�s5�p2}2/"'H34�� %)n&��(#�&('�&�J"#�&�&�J%'R�(#�&(�p�v¿�� à wi����#<��'H#�262p�|wi'q7"���%C%2}'H3~)�IH#8 �C�'q#�)x2p�és #�)�¸��ª¸ R#. %/�&�J%'��7%&(�!3$#�2�J.�"7"'H�n7%2:#8 %'�u.'H %'H�(#�2}�p�*'*)�s�'*2p2B�! 1#$C%�p2p2p�p�� �/"�!34'H %)(�p�� "#�2¤wi'H#�&�z"��'�)�7"#�I*'.�A '*& £���� Ñ � � Ê ¸C"'x&�J%'��%7%&(�!3$#�2�J8��7�'H�(7%2!#. %'4�! �wi'H#�&�z"��'x)�7"#�I*'.���]'�s5�p2p2l)�J%��s�K%&�J"#�&5&�J%'xs�'*�}u%J.&() £�� w��%��&�J%'��7%&(�!3$#�2ÆJ8��7�'H�(7%2!#. %'��: &�J%'xw�'H#�&cz"��'�)�7"#�I*'$IH#. �C"'4s-���p&(&n'H #�)�)(��34'42p�! %'H#.��I*��34C%�! "#�&(�p�� <��w)�z"7"7�����& � '*I*&n����) £�� Ê �)�z"7"7�����& � '*I*&n����) � � � Õ ¿i· Ã>5J%'�2p�! %'H#.�,/"'*I*�p)(�p�� �w�z" %I*&n�}�% � ¿ à �! R&�J%',wi'H#�&�z"��'�)c7"#�I*'�s5�p2p2l#�I*I*����/"�! %u.2p�1C"'x��wl&�J%',w��%�(3R�� ¿ ÃlÊ )n�}u% �� �)�z"7"7"����& � '*I*&(����) � � � �� Ñ � ���� Á ¿i» Ãs-J%'H��' � �� �p)5&�J%'�/"�.&�DF7"����/Zz%I*&5C�'*&Fs�'*'H �)�z"7"7�����& � '*I*&(�%��) � #8 %/ � '*I*&n��� �! �w�'H#�&cz"��'-)c7"#�I*'.�>5J%'�/"'*I*�p)(�p�� �wyz" %I*&(�p�� RIH#. �&�J%'H��'cwi����'�C�'�/"'*)(IH���!C�'*/ #�)G#~&Fs��42!#X�.'H�5 %'*&�s����(9®��¹l�pu�z"��'&"�"

References

Page 33: A quick introduction to Statistical Learning Theory and Support Vector Machines

non−linear transformation

1w iw jw Nw

z isupport vectorsin feature space

input vector in feature space

xinput vector,

classification

¹l�pu�z"��' "�6��2!#�)()n�ª{ZIH#�&(�p�� 1C8�1#4)�z"7"7�����&�D � '*I*&n���5 %'*&�s����(94��w6#. Rz" "9% %��s- 17"#�&n&('H�( ��p)-I*�% %I*'H78D&�z"#�2p2p��/"�� %'�C.�~{N��)(&�&��(#. %)�w����n34�! %u�&cJ%'�7"#�&n&('H�( ��: 8&(�$)n��34'�J%�}u%J8Dy/"�:3~'H %)(�p�� "#�2¤w�'H#�&cz"��'-)c7"#�I*'.�=� ���7%&(�!3$#�2�J.�"7"'H�n7%2:#8 %'oI*�� %)n&��(z%I*&('*/]�! r&�J%�p)�wi'H#�&�z"��'�)�7"#�I*'$/"'*&('H�n34�! %'*)�&cJ%'$��z%&�7"z%&H��>5J%')(�!34�p2!#.���p&��R&(�o#4&�s���Dy2!#X�.'H��7"'H��I*'H7%&c���� �IH#. <C"'�)n'*'H RC.�RI*��3�7"#.���p)(�� �&(�4¹l�pu�z"��' Ì �

References

Page 34: A quick introduction to Statistical Learning Theory and Support Vector Machines

α1 αi α j

k

αs

u ju iu1 usku = K( , )

x

x

x x

input vector,

vectors,support

Lagrange multipliers

classification

k

comparison

¹l�pu�z"��' "®� �G2:#�)()(�|{ZIH#�&(�p�� ���w~#. �z" "9% %��s- �7"#�&(&('H�n �C.��#�)�z"7"7"�%��&�D � '*I*&(�%�R %'*&Fs����n9Z� >5J%'7"#�&(&n'H�( <�})4�! ��! "7"z%&4)�7"#�I*'$I*�%3$7"#.��'*/]&n�R)�z"7"7"����& � '*I*&(�%��)H�4>5J%'o��'*)�z%2p&(�! %u � #�2:z%'*)$#.��'o %�� 8D2p�! %'H#.��2p�4&��(#. %)�wi���(3~'*/B�L=v2}�! %'H#.�¤wyz" %I*&(�p�� ���wZ&�J%'*)('�&��(#. %)�wi���(3~'*/ � #�2!z%'*)l/"'*&('H�(3~�: %'�&�J%'6�%z%&�7"z%&��wl&�J%'�I*2!#�)()(�|{Z'H�*�� ��s�' � 'H�*KL' � 'H ��|w-&cJ%'$��7%&(�!3$#�2�J.�"7"'H�n7%2:#8 %'ou.'H %'H�n#�2p�}�*'*)�s�'*2p2�&�J%'$&('*I*J" %�pIH#�257"����C%2p'H3���wJ%��s�&(�x&���'H#�&�&�J%'�J%�}u%J�/"�:3~'H %)(�p�� "#�2®w�'H#�&�z"��',)�7"#�I*'���'H3$#��! %)H�6�� Ì � � Í �p&�s5#�)6)cJ%�és, Rº4 ¼FK%&�J"#�&&�J%'�����/"'H�T��wN��7�'H�(#�&(�p�� %)Twi���¤I*�� %)(&��(z%I*&n�: %u4#,/"'*I*�p)(�p�� xwyz" %I*&(�p�� 4IH#. $C�'��: 8&('H��I*J"#. %u.'*/B�T�! %)(&('H#�/��wZ3$#.98�! %u�#5 %�� 8Dy2p�! %'H#.�l&��n#. %)�wi���(3$#�&n�}�% 5��wZ&cJ%'6�! "7"z%& � '*I*&(����)¤w��82}2p��s�'*/�C8��/"�.&�DF7"���%/Zz%I*&()Ts5�p&�J)�z"7"7�����& � '*I*&(�%��)5�! <wi'H#�&�z"��'x)�7"#�I*'.K®�� %'4IH#. �{N��)(&,I*��3$7"#.��'x&�s�� � '*I*&(����),�! <�: "7"z%&x)�7"#�I*'R¿iC.�'.�øuZ�4&c#.9.�! %uR&�J%'*�!��/"�.&�DF7"���%/Zz%I*&����,)n��34'4/"�p)(&�#. %I*'R34'H#�)�z"��' à KB#. %/]&�J%'H r3$#.9�'$#R %�� 8Dy2p�! %'H#.�&��(#8 %)�w��%�(3$#�&(�p�� ,��wN&�J%' � #�2:z%'@��wZ&�J%'5��'*)�z%2p&H�lO%'*'5¹l�pu�z"��'6"Z�T>5J%�p)6'H "#.C%2p'*)æw����¤&�J%'�I*�� %)(&c�(z%I*&(�p�� ��wN���pI*J�I*2!#�)()n'*)l��w�/"'*I*�p)(�p�� ~)�z"��wy#�I*'*)HK w����¤'*�"#.3$7%2p'�7��.2p�� %��3~�:#�2"/"'*I*�})n�}�% 4)�z"��wy#�I*'*)T��w�#8�(C%�p&��(#.���p2p�/"'*u���'*'.�æ�]'�s5�p2p2BIH#�2p2T&�J%�p)5&���7�'���wT2}'H#8�( %�! %u13$#�I*J%�: %'�w��%�6)cz"7"7"����&�D � '*I*&(����), %'*&Fs����n9��X�>@J%'-&('*I*J" %�pÈ�z%'���w )cz"7"7"����&�D � '*I*&(���� %'*&�s����(98)lsG#�)6{N��)(&6/"' � '*2p��7�'*/Rwi���l&cJ%'���'*)n&����pI*&('*/�IH#�)('��w5)('H7"#.�n#�&(�! %u�&��(#��! %�! %u�/Z#�&�#Rs@�}&cJ%��z%&4'H�(������)X�q�� ]&cJ%�})$#.��&(�pI*2}'Rs�'$'*�%&('H %/�&�J%'R#.7"7"����#�I*J���w)�z"7"7�����&�D � '*I*&n���6 %'*&�s����(98)6&(�xI*� � 'H�ls-J%'H �)('H7"#.�n#�&(�p�� �s5�p&�J%��z%&�'H�(�����æ�� 4&�J%',&��(#��: %�! %u � '*I*&(����)���$âªáeçoáeçXâªülê�ß�ã�ä/.Lä�ä�ã6åXçéßcüeâ���ä5ç�è2.]ù�ú�1éù�âªßn÷ áeç�ä�â �Hä�ß�è +Bä 3cå�ß(ê �*âøê���áeç�ä5üièn÷ 1éáeâªènêoè(ê$ü#1XåHå.è(úyá ��ä�ù�áiènúyüâªü6+!ènú�áeçéä�üiä,÷ªä�ßnúiêHâøê��ã�ßcùiçXâøêéä�ü�õ�:êoáeç�ä-ü#1XåHå.è(úyáeó ��ä�ù�áiè(úyü@÷ªä�ß(úiêXâøê�$ßn÷��cè(úiâªáeç�ã äcõ �XõZáeçéä,ù�è�ã�åX÷ªä 3câªáeà�è2+láeçéäù�è(ê�üiáeú�1éù�áeâªènê �Hè*ä�üLêéècá��Hä�å.ä�ê ��ènê-áeçéä �*âªã�ä�êéüeâªènêéßn÷ªâpáþà�è +�áeç�ä�+!ä�ßcá#1HúyäTüeåéßcùFä � % 1�áNènê�áeçéä¤ê�1éã'%.ä�ú�è +�ü#1XåHå.ènúyá

�cäFù�áiènúyü�õ·

References

Page 35: A quick introduction to Statistical Learning Theory and Support Vector Machines

�p)4�:3�7"�.)()n�:C%2p'.�<���}&cJ]&�J%�p)x'*��&('H %)n�}�% ]s�'$I*�� %)n�}/"'H�~&�J%'$)�z"7"7�����&�D � '*I*&n���� %'*&�s����(98)�#�)�#< %'*sI*2!#�)()x��w�2p'H#.�( %�! %u]3$#�I*J%�! %'.Kl#�)�7"��s�'H��wyz%26#. %/�z" %� � 'H��)�#�2�#�)� %'Hz"�(#�2� %'*&�s����(98)H���� �O%'*I*&n�}�% ��s�'5s5�p2p2L/"'H34�� %)(&��n#�&('5J%��s�s�'*2p2Z�p&�u.'H %'H�(#�2p�p�*'*)�wi���lJ%�pu�J$/"'*u%��'*'�7��.2p�� %�%34�!#�2L/"'*I*�})n�}�% �)�z"��wy#�I*'*)¿iz"7<&(������/"'H��» à �! ]#RJ%�pu�J�/"�!34'H %)(�p�� "#�26)�7"#�I*'1¿e/"�!34'H %)n�}�% Í �.· à ��>@J%'~7�'H��wi���(3$#. %I*'4��w�&�J%'#�2pu.�����}&cJ"3 �p)$I*�%3$7"#.��'*/v&(��&�J"#�&���w-I*2!#�)n)(�pIH#�252p'H#.�( %�! %uv3$#�I*J%�: %'*)�'.�øuZ�T2p�: %'H#8�$I*2!#�)()n�ª{Z'H��)HK�9�D %'H#.��'*)n&G %'*�pu�J%C"�%��)�I*2:#�)()(�|{Z'H�*KZ#. %/R %'Hz"�(#�2L %'*&�s��%�(9.)X�ÆO%'*I*&n�}�% Í K�"K%#. %/ "~#8��'5/"' � �.&n'*/o&(�x&�J%'3$# �F���¤7"�.�! 8&()l��wZ&�J%'�/"'H��� � #�&(�p�� ~��wZ&�J%'5#�2pu.�����p&�J"3#. %/$#@/"�})nIHz%)()(�p�� 4��wZ)n��34'6��wZ�}&n)67"����7"'H��&(�p'*)H�� '*&�#��p2p),��w�&�J%'x/"'H��� � #�&(�p�� ��p)���'*2p'*u�#�&('*/1&(�$#. R#87"7"'H %/"�p�N�

»

References

Page 36: A quick introduction to Statistical Learning Theory and Support Vector Machines

� ��� ¯Nµ�������� �� ° � ���,­ ���y Ö&�J%�p)1)('*I*&(�p�� �s�'���' � �p'*s�&�J%'�34'*&cJ%��/���w4��7%&(�!3$#�2�J8��7�'H�(7%2!#. %'*)vº Ì "X¼�wi���$)n'H7"#.�(#�&(�p�� ���w&��(#��: %�! %u$/Z#�&�#�s5�p&�J%�%z%&5'H�(������)H���� �&�J%'� %'*��&�)('*I*&n�}�% �s�',�! .&c����/Zz%I*'�#4 %�.&n�}�% $��wT)(��wi&�3$#.��u.�! %)HK&�J"#�&�s5�p2p2l#�2p2}��s w��%��#8 R#. "#�2p��&n�}Ix&���'H#�&c34'H .&���wl2p'H#.�( %�! %uRs5�p&�J�'H�(������)5�� �&cJ%'�&��(#��! %�! %uo)n'*&H������ ����������! #"%$'&�(*)����,+!��&�$.-��0/1&32�45+! 3�!��">5J%'�)('*&@��wÆ2!#.C�'*2p'*/ &c�(#��! %�! %uq7"#�&n&('H�( %)¿36 + Á Å + Ã*Á ÕHÕHÕ Á ¿3687 Á Å97 Ã,Á 6 ��:1; Î Ì Á Ì=< ¿�� Ã�p)�)�#��}/r&(�1C"'$2p�! %'H#.��2p�])n'H7"#.�(#.C%2p'$�|w5&�J%'H��'4'*�%�})n&()�# � '*I*&n��� £ #. %/]#�)(IH#�2:#8� � )cz%I�J�&cJ"#�&�&�J%'�! %'*È�z"#�2}�p&(�p'*) £ � Å � Ñ � > Ì �|w 6 � Ê Ì Á£ � Å � Ñ � � Î Ì �|w 6 � Ê Î Ì Á ¿ � Ã#.��' � #�2p�p/�w��%�5#�2p2�'*2p'H34'H 8&(),��w6&�J%'x&��(#��! %�! %uo)n'*&~¿�� à �6Ç�'*2p�éss�'�s-���}&n'�&�J%'x�: %'*È"z"#�2p�p&(�p'*)q¿ � à �! &�J%',wi���(3@?�� 6 � ¿ £ � Å � Ñ �*ÃA> Ì Á % Ê Ì Á ÕXÕHÕ ÁCB Õ ¿ Ì ¸ Ã>@J%'���7%&(�!3$#�2�J8�"7"'H�(7%2!#. %' £�� � Å Ñ � � Ê ¸ ¿ Ì.Ì Ã�p)�&�J%'$z" %�pÈ"z%'$�� %'$s-J%�pI*J<)('H7"#.�(#�&n'*)�&�J%'4&��(#��! %�! %u1/Z#�&c#$s5�p&�Jr#o3$#����!3$#�263�#.��u.�! L�5�p&x/"'*&('H��D34�! %'*)�&cJ%'$/"�!��'*I*&(�p�� £ED Ò £ Ò s,J%'H��'$&�J%'4/"�p)(&�#. %I*'�C"'*&�s�'*'H �&�J%'o7"�����F'*I*&n�}�% %)���w�&�J%'4&��n#��! %�: %u� '*I*&(�%��)l��wZ&�s��-/"���B'H��'H 8&6I*2!#�)()('*)��p)�3$#��%�!3$#�2iK���'*IH#�2p2"¹l�}u%z"��' Í �T>5J%�p)6/"�p)(&�#. %I*'GF ¿ £ Á �*à �})lu8� � 'H C8� FB¿ £ Á �*ÃBÊ 34�! HJI!K!LNM +PO Å �c£Ò £ Ò Î 3�#��HCI!KQL�M Ð +PO Å �c£Ò £ ÒÕ ¿ Ì Í Ã>5J%'x��7%&(�!3$#�2lJ.�"7"'H�n7%2:#8 %' ¿ £ � Á � � à �p)�&�J%'�#8��u�z"34'H 8&()5&�J"#�&�3$#����!34�p�*'4&�J%'�/"�p)(&c#. %I*'1¿ Ì Í Ã �5��&wi�.2p2}��s5)Gw����%3 ¿ Ì Í Ã #. %/�¿ Ì ¸ à &�J"#�&

FB¿ £�� Á � � ÃlÊ ÍÒ £�� Ò Ê ÍR £�� � £�� Õ ¿ Ì Ã>5J%�p)$34'H#. %)HKT&�J"#�&�&�J%'���7%&(�!3$#�2�J.�"7"'H�n7%2:#8 %'1�p)4&�J%'qz" %�pÈ"z%'R�� %'�&�J"#�&�3~�: %�!34�p�*'*) £ �i£ z" 8D/"'H�,&�J%'4I*�� %)(&c�(#��! .&n)o¿ Ì ¸ à �����% %)(&��(z%I*&(�! %u1#. <��7%&n�:3�#�2ÆJ8�"7"'H�(7%2!#. %'��p)�&�J%'H��'cwi����'$#�È�z"#�/Z�(#�&(�pI7"���.u%�(#.3$34�! %uo7"����C%2p'H3R�SJT ècáiäláeçéßcáLâøê�áeçéäTâøê�ä &�1éßn÷øâªáeâ}ä�ü5ì!í*ðLßnê �xì3UFò*ðLáeçéäæúiâ��(çHáeóÔçéßnê ��üeâ �Hä � % 1éá¤êéècá ��ä�ù�áiènú�V �*âªü¤êéènúyã�ß(÷øâ���ä �.õ

References

Page 37: A quick introduction to Statistical Learning Theory and Support Vector Machines

0l'*I*&n����)5Å � wi����s,J%�}I*J6 � ¿ £ � Å � Ñ �*Ã%Ê Ì s5�p2p26C"'x&('H�(34'*/r\�^ _._"`.a�b�fXg�hXbi`.a(\*���y <=�7"7�'H %/"�p�s�'�)�J%��s�&cJ"#�&5&�J%' � '*I*&n��� £�� &�J"#�&�/"'*&('H�(34�! %'*)�&�J%'x��7%&(�!3$#�2TJ.�"7"'H�(7%2!#. %'4IH#. <C"'4s-���p&(&n'H R#�)#42p�! %'H#.�@I*��34C%�: "#�&(�p�� R��wl&��(#��: %�! %u � '*I*&(����)H�£�� Ê 7� � M + 6 � � �� Å � Á ¿ Ì " Ã

s-J%'H��' � �� > ¸"��O%�! %I*'���� ¸<�� %2p��wi���x)�z"7"7�����& � '*I*&(����)q¿e)('*'�=�7"7"'H %/"�p� à K�&�J%'o'*�"7"��'*)()n�}�% ¿ Ì " à ��'H7"��'*)('H 8&()-#�I*��3$7"#�I*&�w��%�(3 ��w6s-���}&n�: %u £�� ���]'�#�2})n�o)�J%��s�&cJ"#�&5&(�4{N %/<&�J%' � '*I*&(������w7"#.�(#834'*&('H��)/� � � � Ï� Ê ¿�� � + Á ÕXÕHÕ Á � �7 Ã�Á�� %'�J"#�)@&(�$)(�.2 � '�&cJ%'�wi�.2p2}��s5�! %u�È�z"#�/Z�(#�&(�pI�7"���.u%�(#.3$34�! %u$7"����C%2p'H3R�� ¿ � ÃlÊ � Ï�� Î ÌÍ � Ï�� � ¿ Ì � Ãs5�p&�J<��'*)�7"'*I*&@&(� � Ï Ê ¿�� + Á ÕHÕXÕ Á �97 à K")�z"C �F'*I*&5&n�$&�J%'�I*�� %)(&c�(#��! .&n)H�� >�]Á ¿ Ì · Ã� Ï� Ê ¸ Á ¿ Ì » Ãs-J%'H��' � Ï Ê ¿ Ì Á ÕHÕXÕ Á Ì Ã �p)�#8 B Dy/"�!34'H %)(�p�� "#�2oz" %�p& � '*I*&(��� K Ï Ê ¿36 + Á ÕXÕHÕ Á 6 7 à �})�&�J%' B D/"�!34'H %)(�p�� "#�2 � '*I*&(�%����wl2:#8C"'*2p)HKL#. %/ � �p)-#4)n��3$3~'*&����pI B ��B DF3�#�&����p��s5�p&�J�'*2p'H34'H .&n)� � � Ê 6 � 6 � Å � � Å � Á % Á���Ê Ì Á ÕXÕHÕ Á�� Õ ¿ Ì � Ã>5J%'5�! %'*È"z"#�2p�p&F��¿ Ì · à /"'*)(IH���!C"'*)�&�J%'5 %�� " %'*u%#�&(� � '5È"z"#�/Z�(#. 8&H�¤�]'�&�J%'H��'cwi����'5J"# � '�&(��3$#��%�!34�p�*'&�J%'�È"z"#�/Z�(#�&n�}I�w��%�(3�¿ Ì � à �! R&�J%'� %�% " %'*u�#�&(� � 'xÈ�z"#�/Z�n#. .&XK�)cz"C �F'*I*&@&(�$&�J%'�I*�� %)n&��(#��! 8&()�¿ Ì » à ���J%'H ,&�J%'l&c�(#��! %�! %u-/Z#�&c#�¿�� à IH#. �C�'6)('H7"#.�n#�&('*/�s5�p&�J%��z%&¤'H�(������)Ls�'6#�2p)(�5)cJ%�ésr�: �=�7"7�'H %/"�p�= &cJ%'1wi�.2p2p��s5�! %u���'*2!#�&(�p�� %)cJ%�:7 C"'*&�s�'*'H �&�J%']3$#��%�!3�z"3 ��w�&�J%'<w�z" %I*&(�p�� "#�2o¿ Ì � à K@&�J%']7"#��!�¿ � � Á � � à K®#. %/R&�J%'�3�#����!3$#�2 3�#.��u.�! �F � w����%3 ¿ Ì Ã �� ¿ � � ÃlÊ ÍF ;� Õ ¿ Ì � Ã�ewTwi����)(��3~' ��� #. %/<2:#8��u.'�I*�� %)(&c#. .& � � &cJ%'��: %'*È"z"#�2p�p&��

� ¿ ��� à � � � ¿ Í ¸ Ã�p) � #�2}�p/BK®�� %',IH#. R#�I*I*����/"�! %u.2p�R#�)()('H��&�&cJ"#�&�#�2p2�J8�"7"'H�(7%2!#. %'*),&�J"#�&�)('H7"#.�(#�&n'-&�J%'�&c�(#��! %�! %u$/Z#�&�#¿�� à J"# � '�#$3�#.��u.�! F���� Í� � Õ�

References

Page 38: A quick introduction to Statistical Learning Theory and Support Vector Machines

�ewZ&cJ%'6&��(#��: %�! %u,)('*&5¿�� à IH#. x %�.&�C�'6)('H7"#.�n#�&('*/�C8��#5J8��7�'H�(7%2!#. %'.K.&cJ%'�3$#.��u.�! �C"'*&�s�'*'H �7"#�&(&n'H�( %)��wL&�J%'�&Fs��,I*2!#�)()('*)6C�'*I*��34'*)�#8�(C%�p&��(#.���x)�3$#�2p2iK���'*)�z%2p&(�! %u��! 4&�J%' � #�2:z%',��wZ&�J%'�wyz" %I*&(�p�� "#�2 � ¿ � Ã&�z"�( %�! %u$#.�(C%�p&��n#.����2!#.��u.'.� ¶ #��%�:3~�}�*�! %u4&�J%'@w�z" %I*&n�}�% "#�2Æ¿ Ì � à z" %/"'H��I*�� %)(&��(#��: 8&()�¿ Ì · à #. %/�¿ Ì » Ã�� %'�&�J%'H��'cwi����'�'*�p&�J%'H�l��'H#�I*J%'*)Æ#�3�#����!34z"3 ¿e�! �&cJ%�})lIH#�)('6�� %'5J"#�)TI*�� %)(&c�(z%I*&('*/4&�J%'�J8�"7"'H�(7%2!#. %'s5�p&�Jx&�J%'53$#��%�!3$#�2"3$#.��u8�: F � à KH���¤�� %'�{N %/")l&�J"#�&T&�J%'53$#��%�!34z"3'*��I*'*'*/")l)(�%34'6u.� � 'H <¿e2!#.��u.' ÃI*�� %)(&c#. .& � � ¿e�! �s-J%�pI*J�IH#�)('�#x)('H7"#.�(#�&(�p�� ���w�&�J%',&��(#��: %�! %u�/Z#�&�#xs5�p&�J�#$3$#.��u.�! $2!#.��u8'H��&�J%'H � Í D � � �p)5�!3$7"�8)()(�!C%2p' à �>@J%' 7"����C%2p'H3 ��w�3$#��%�!34�p�*�: %u w�z" %I*&n�}�% "#�2o¿ Ì � à z" %/"'H��I*�� %)(&��(#��: 8&()]¿ Ì · à #. %/�¿ Ì » à IH#. C"'�)(�.2 � '*/ � 'H����'c©oI*�p'H 8&(2p��z%)(�! %u�&�J%'�w��.2p2p��s5�! %u�)(I�J%'H3~'.� � � � �p/"'R&�J%'$&��(#��: %�! %u�/Z#�&�#��! .&(��# %z"3�C�'H����wÆ7�����&(�p�� %)�s5�p&�JR#4��'H#�)(�� "#.C%2p'�)�3$#�2}2B %z"34C"'H����wT&��(#��! %�! %u � '*I*&(�%��)6�! R'H#�I�J�7"����&(�p�� L�O%&�#.��&B��z%&6C8��)(�.2 � �! %u�&�J%'�È"z"#�/Z�(#�&n�}I�7"���.u��n#.3$34�! %u�7"����C%2p'H3/"'*&('H�(34�! %'*/oC8��&�J%'�{N��)(&l7"�%��&(�p�� ��w6&��n#��! %�: %uR/Z#�&�#"�,¹Z���@&�J%�p)�7"����C%2p'H3×&�J%'H��'�#.��'4&�s��o7"�8)()(�!C%2p'$��z%&(I*��3~'.��'*�}&cJ%'H��&�J%�p)�7"�%��&(�p�� ��wl&�J%'�/Z#�&�#~IH#. 1 %�.&,C"'�)n'H7"#.�(#�&('*/1C8�R#oJ8��7�'H�(7%2!#. %'q¿��: �s-J%�pI�J<IH#�)n'�&�J%'�wyz%2p2B)('*&,��wÆ/Z#�&�#4#�)s�'*2p2NIH#8 $ %�.&5C"',)('H7"#.�(#�&('*/ à K.�%�l&�J%',��7%&(�!3$#�2LJ.�"7"'H�n7%2:#8 %',w����æ)('H7"#.�(#�&(�! %u�&cJ%'�{N��)(&�7"����&n�}�% $��w&�J%'�&��n#��! %�: %u�/Z#�&�#~�})Gw���z" %/B�A '*&�&�J%' � '*I*&n���5&cJ"#�&�3�#����!34�p�*'*)�w�z" %I*&(�p�� "#�2-¿ Ì � à �! �&cJ%'�IH#�)('���w�)('H7"#.�(#�&(�p�� R��w�&�J%'�{N��)(&7"�%��&(�p�� �C�' � + ��=�34�� %u]&cJ%'RI*������/"�: "#�&('*)o��w � '*I*&n��� � + )(��34'1#8��'R'*È�z"#�25&(�]�*'H���®��>5J%'*�I*���(��'*)�7"�� %/<&(�o %�� 8Dy)cz"7"7"����&@&��(#��! %�! %u � '*I*&(����)5��wÆ&�J%�p)�7"����&(�p�� L� ¶ #.9�'�#� %'*s�)('*&5��wÆ&��n#��! %�: %u/Z#�&�#rI*�� 8&�#��! %�! %uv&�J%'1)cz"7"7"����& � '*I*&(����)�w�����3 &�J%'<{N��)(&R7"�%��&(�p�� v��w�&��(#��! %�! %u�/Z#�&c#]#. %/�&�J%'� '*I*&(�%��)B��w"&�J%'l)('*I*�� %/47"����&(�p�� �&�J"#�&B/"��'*)T %�.&B)�#�&n�})�w��,I*�� %)(&��n#��! .&�¿ Ì ¸ à K*s,J%'H��' £ �p)�/"'*&('H�n34�! %'*/C8� � + �¹®����&�J%�p)R)('*&R#v %'*s×w�z" %I*&(�p�� "#�2 � ; ¿ � à �p)RI*�� %)(&��nz%I*&('*/�#. %/�3$#��%�:3~�}�*'*/#�& � ; ����� 8&(�! %z%�: %u�&�J%�p)�7"����I*'*)n)-��wl�! %IH��'H34'H 8&�#�2p2p�1I*�� %)n&��(z%I*&(�! %uR#$)(�82:z%&n�}�% � '*I*&(�%� ��� I*� � 'H���! %uo#�2p2&�J%'$7"�%��&(�p�� %)���w�&�J%'4&��(#��: %�! %u</Z#�&�#��� %'$'*�p&�J%'H��{N %/")�&�J"#�&-�p&��p)x�:3�7"�.)()n�:C%2p'�&(�R)('H7"#.�n#�&('4&�J%'&��(#��: %�! %u�)('*&5s5�p&�J%�%z%&5'H�(����� K��%���� %'�I*�� %)(&c�(z%I*&()5&�J%'x��7%&(�!3$#�2L)('H7"#8�(#�&(�! %uoJ8��7�'H�(7%2!#. %'xw��%��&�J%'wyz%2}2L/Z#�&c#�)('*&HK � � Ê � � � � �.&('.K8&�J"#�&6/Zz"���: %u�&�J%�p)G7"����I*'*)()�&�J%' � #�2!z%'���wB&�J%'@w�z" %I*&n�}�% "#�2 � ¿ � Ã�p)-34�� %�8&(�� %�pIH#�2p2p�R�: %IH��'H#�)(�! %uZK")(�! %I*'434����'�#. %/134����'-&��n#��! %�: %u � '*I*&(����)�#.��'�I*�� %)n�}/"'H��'*/1�! �&�J%'��7%&(�!34�p�H#�&n�}�% LKZ2p'H#�/"�! %u�&(�o#4)�3�#�2p2}'H�,#. %/R)�3�#�2p2}'H�@)('H7"#.�n#�&(�p�� RC"'*&�s�'*'H R&cJ%'�&Fs��4I*2!#�)()('*)X�� �� �� ±��*¯�� ��°�-µ�­ �� � ° � � �,­ ���� %)n�}/"'H�,&cJ%'�IH#�)('xs-J%'H��'4&�J%'x&��(#��! %�! %uR/Z#�&�#4IH#8 %�8&�C�'�)('H7"#.�n#�&('*/1s@�}&cJ%��z%&�'H�(����� �5�y <&�J%�p)IH#�)('��� %'o3$#X�1s5#. 8&�&(�R)('H7"#8�(#�&('4&�J%'$&c�(#��! %�! %u )n'*&�s5�p&�J�#R3~�: %�!3$#�2G %z"34C"'H����w�'H�(������)H�$>¤�'*�"7"��'*)()5&�J%�p)@w��%�(3$#�2p2p�R2p'*&-z%)-�! 8&����%/Zz%I*'�)(�%34'� %�� 8DF %'*u�#�&n� � ' � #.���!#.C%2p'*)� � > ¸ Á % Ê Ì Á ÕXÕHÕ ÁJB ��r'�IH#. R %��s�34�! %�!34�p�*'�&cJ%'�wyz" %I*&(�p�� "#�2

� ¿� Ã�Ê 7� � M + � � ¿ Í Ì Ãwi����)�3$#�2}2�� ��¸"K%)�z"C ��'*I*&5&(�$&cJ%'�I*�� %)(&��(#��: 8&()6 � ¿ £ � Å � Ñ �*ÃA> Ì Î� � Á % Ê Ì Á ÕXÕHÕ ÁCB�Á ¿ Í.Í Ã

Ì ¸

References

Page 39: A quick introduction to Statistical Learning Theory and Support Vector Machines

�> ¸ Á % Ê Ì Á ÕXÕHÕ ÁCB Õ ¿ Í Ã¹Z�%�l)�z8©oI*�p'H 8&(2p�o)c3$#�2p2�R&�J%'@w�z" %I*&n�}�% "#�2Æ¿ Í Ì Ã /"'*)(IH���:C�'*)5&�J%'� �z"34C�'H�6��w &cJ%'-&��n#��! %�: %u4'H�(������)��X�¶ �! %�!34�p�*�! %u�¿ Í Ì Ã �� %',{N %/"),)(��34'�3~�: %�!3$#�2�)�z"C%)n'*&5��wl&��(#��! %�! %uo'H�n������)H�¿36 � Ý Á Å � Ý Ã*Á ÕXÕHÕ Á ¿36 � � Á Å � � à Õ�ewl&�J%'*)('�/Z#�&�#4#.��'�'*�%I*2:z%/"'*/<wy����3 &�J%'�&��n#��! %�: %u�)('*&5�� %'�IH#. �)n'H7"#.�(#�&(',&�J%'���'H3�#��! %�: %uR7"#.��&���w&�J%',&��(#��! %�! %u4)('*&6s@�}&cJ%��z%&�'H�(������)X��>¤��)n'H7"#.�(#�&('@&�J%'-��'H3$#��: %�! %uo7"#8��&6��wB&�J%',&��(#��: %�! %u4/Z#�&�#��� %'IH#. �I*�� %)n&��(z%I*&-#. ��%7%&(�!3$#�2B)('H7"#8�(#�&(�! %uoJ8��7�'H�(7%2!#. %'.�>@J%�})@�}/"'H#�IH#. <C"'�'*�"7"��'*)()n'*/�w����n3$#�2p2p�1#�)H�l34�! %�!34�p�*'$&�J%',wyz" %I*&(�p�� "#�2ÌÍ £ ; Ñ�� É � 7�� M + � ¿ Í " Ã

)�z"C ��'*I*&�&(�4I*�� %)(&��n#��! .&()�¿ Í.Í Ã #. %/�¿ Í Ã K�s-J%'H��' É ¿�� à �}),#�3~�� %�.&(�� %�pI,I*�� � '*�4wyz" %I*&(�p�� R#. %/ ��p)�#~I*�� %)(&�#. 8&H�¹®���l)cz8©oI*�p'H .&(2p��2!#.��u.' � #. %/�)�z8©oI*�p'H .&n2}��)�3$#�2}2�lK%&�J%' � '*I*&(�%� £�� #8 %/RI*�� %)(&�#. 8& � � K%&�J"#�&34�! %�!34�p�*'�&�J%'Gw�z" %I*&(�p�� "#�2æ¿ Í " à z" %/"'H��I*�� %)(&��n#��! .&()�¿ Í.Í Ã #8 %/ ¿ Í Ã K./"'*&('H�(3~�: %',&�J%'-J8�"7"'H�(7%2!#. %'&�J"#�&434�! %�!34�p�*'*)o&�J%'R %z"3�C�'H�x��w-'H�(���%��)��� r&�J%'�&��(#��! %�! %u�)('*&$#. %/�)('H7"#.�(#�&n'$&�J%'R��'*)(&���w-&�J%''*2p'H34'H 8&()-s@�}&cJR3$#��%�:3�#�2�3$#.��u.�! L�� �.&('.K�J%��s�' � 'H�*K@&�J"#�&R&�J%'�7"����C%2p'H3 ��w�I*�� %)(&c�(z%I*&(�! %u�#�J.�"7"'H�(7%2!#. %'�s-J%�pI*J�34�! %�!34�p�*'*)&�J%'] %z"3�C�'H����w�'H�(���%��)o�% v&�J%'�&��(#��: %�! %u�)('*&o�p)R�! �u.'H %'H�n#�2 �,¨ DyI*�%3$7%2p'*&('.� >¤� # � �.�p/ �,¨ DI*��3$7%2p'*&('H %'*)n)���wG�%z"��7"����C%2p'H3 s�'�s5�p2}2�I*�� %)n�}/"'H�x&�J%'�IH#�)('$��w � Ê Ì ¿e&�J%'4)�3$#�2}2p'*)(& � #�2!z%'o��w��wi���4s-J%�pI*J�&�J%'1��7%&n�:3~�}�H#�&(�p�� �7"���%C%2}'H3 ¿ Ì � à J"#�)o#�z" %�pÈ�z%'�)(�.2!z%&(�p�� à ���y v&cJ%�})�IH#�)('<&�J%'wyz" %I*&(�p�� "#�2�¿ Í " à /"'*)(IH���!C"'*)�¿Ôw��%��)�z8©oI*�p'H 8&(2p��2!#.��u.' �4à &�J%'R7"����C%2p'H3 ��w-I*�% %)(&��(z%I*&(�! %u]#R)n'H78D#.�(#�&(�! %u�J.�"7"'H�n7%2:#8 %' s,J%�}I*J�3~�: %�!34�p�*'*)�b���g<\�^ � `��.gHf � �.b��e`.j�\�K ZK6��w�&��(#��: %�! %u�'H�(������)o#. %/3$#��%�!34�p�*'*)4&�J%'o3$#.��u.�! <w��%��&�J%'$I*�%�(��'*I*&(2p��I*2!#�)()(�|{Z'*/ � '*I*&(�%��)H�o��w-&�J%'$&c�(#��! %�! %u /Z#�&�#�IH#. ]C�')('H7"#.�n#�&('*/$s5�p&�J%�%z%&�'H�n������)�&�J%'-I*�% %)(&��(z%I*&('*/�J.�"7"'H�(7%2!#. %'xI*�.�! %I*�p/"'*)5s5�p&�J�&�J%'-�%7%&(�!3$#�2N3�#.��u.�! J8��7�'H�(7%2!#. %'.��� <I*�� .&c�(#�)(&5&n�o&�J%'xIH#�)('4s5�p&�J�� � Ì &cJ%'H��'�'*�%�p)(&()�#. <'c©oI*�p'H 8&�34'*&�J%�%/")5wi���G{N %/"�: %u<&�J%')(�.2!z%&(�p�� ���w�¿ Í " à �: �&cJ%'5IH#�)(',��w � Ê Ì �lAL'*&5z%)6IH#�2}2L&�J%�p)�)(�.2!z%&(�p�� <b���g�\�`��*b ���8a�� �þj ���n_�gXai_ � �.j®gH��� �=�7"7�'H %/"�p� =s�'�I*�� %)(�p/"'H�@&�J%'�7"����C%2p'H3���wæ3~�: %�!34�p�*�! %u1&cJ%',w�z" %I*&(�p�� "#�2ÌÍ £ ; Ñ � É � 7� � M + � ¿ Í � Ã

)�z"C ��'*I*&4&(�1&cJ%'oI*�� %)n&��(#��! 8&()q¿ Í8Í Ã #. %/�¿ Í Ã K¤s-J%'H��' É ¿�� à �p)$#�34�� %�.&(�% %�}I�I*�� � '*�<w�z" %I*&n�}�% s5�p&�J É ¿i¸ Ã%Ê ¸��6>B��)(�!3$7%2p�|w��<&�J%'�wi���(34z%2!#�)5s�'��� %2p�</"'*)(IH���!C"'4&�J%'xIH#�)('���w É ¿�� Ã�Ê � ; �! <&�J%�p))('*I*&(�p�� L�@¹Z���@&�J%�p)@w�z" %I*&(�p�� <&�J%'x��7%&(�!34�p�H#�&(�p�� �7"���%C%2}'H3���'H3$#��! %)�#4È"z"#�/Z�(#�&(�pI$7"���.u%�(#.3$34�! %u7"����C%2p'H3R���� áeúyßnâøêHâøê��-ä�úiúyènúBâªü¤çéä�úyä �Hä��Hê�ä ��ß�üBß�å�ß�áiáiä�úiê&.Lç�äyúyäláeç�äTâøêéä &�1éßn÷øâªáeà$ìþÿcÿ�ð®ç�è(÷ �Xü .Lâªáeç�����òHõ

Ì.Ì

References

Page 40: A quick introduction to Statistical Learning Theory and Support Vector Machines

�� 4=�7"7"'H %/"�p�R=�s�'@)�J%��sv&�J"#�&6&cJ%' � '*I*&(��� £ K.#�)lwi���l&�J%'@��7%&(�!3$#�2LJ.�"7"'H�(7%2!#. %'�#�2pu.�����}&cJ"3RKIH#. RC�'�s-���p&(&n'H R#�)-#42p�! %'H#.�@I*��34C%�: "#�&(�p�� %)-��wÆ)�z"7"7�����& � '*I*&(����)5Å � �£ � Ê 7� � M + � �� 6 � Å � Õ>¤�${N %/�&�J%' � '*I*&(�%� � Ï Ê ¿�� + Á ÕXÕHÕ Á �'7 à �� %'4J"#�),&(�o)n�.2 � '�&�J%'4/Zz"#�2TÈ"z"#�/Z�(#�&(�pI$7"���.u%�(#.3$34�! %u7"����C%2p'H3���w63$#��%�:3~�}�*�! %u

� ¿ � Á��éÃTÊ � Ï � Î�ÌÍ � � Ï � � Ñ � ;��� ¿ Í · Ã)�z"C ��'*I*&5&(��I*�� %)(&��(#��: 8&() � Ï� Ê ¸ Á ¿ Í » Ã�@> ¸ Á ¿ Í � Ã� � � � � � Á ¿ Í � Ãs-J%'H��' � Á � Á KL#. %/ � #.��'�&cJ%'$)�#.34'�'*2}'H3~'H .&()4#�)�z%)('*/��! �&�J%'$��7%&n�:3~�}�H#�&(�p�� �7"����C%2p'H3�wi���I*�� %)(&c�(z%I*&(�! %u]#. ]�%7%&(�!3$#�2�J8�"7"'H�(7%2!#. %'.K � �p)~#<)(IH#�2:#8�*Kl#. %/Ö¿ Í � à /"'*)(IH���!C�'*)oI*�%����/"�! "#�&('cDys@�})n'�! %'*È�z"#�2}�p&(�p'*)H�� �.&(',&�J"#�&4¿ Í � à �!3$7%2p�p'*)�&�J"#�&5&�J%'�)c3$#�2p2p'*)(&�#�/Z34�p)()(�!C%2p' � #�2!z%' � �! �wyz" %I*&(�p�� "#�25¿ Í · à �p)�-Ê ����� Ê 3$#�� ¿�� + Á ÕHÕXÕ Á �97 à Õ>5J%'H��'cwi����'�&n�4{N %/ #x)(��wi&-3$#8��u.�! �I*2!#�)()(�|{Z'H�@�� %'�J"#�)@&(�x{N %/ # � '*I*&n��� � &cJ"#�&-3$#��%�!34�p�*'*)

� ¿ � ÃlÊ � Ï � Î�ÌÍ � � Ï � � Ñ � ;����� � ¿�.¸ Ã

z" %/"'H�x&�J%'$I*�� %)n&��(#��! 8&() � > ¸1#. %/�¿ Í » à ��>5J%�p)�7"����C%2p'H3 /"���B'H��wy����3×&cJ%'o7"����C%2p'H3 ��w5I*�� 8D)(&��nz%I*&(�! %u�#. v�%7%&(�!3$#�253$#.��u8�: I*2!#�)()(�|{Z'H���� %2p��C.��&�J%'1#�/"/"�p&(�p�� "#�2,&('H�(3 s5�p&�J �������: &�J%'wyz" %I*&(�p�� "#�2-¿�.¸ à � � z%'�&(�R&�J%�p)�&('H�n3 &�J%'4)(�82:z%&n�}�% �&(��&�J%'$7"����C%2p'H3×��w�I*�% %)(&��(z%I*&(�! %uR&�J%'4)(��w�&3$#.��u.�! �I*2:#�)()(�|{Z'H�,�p)-z" %�pÈ�z%'$#. %/R'*�%�p)(&()@w����@#. .��/Z#�&�#4)('*&X�>@J%'4w�z" %I*&n�}�% "#�2�¿�.¸ à �p)$ %�.&�È"z"#�/Z�(#�&(�pIqC�'*IH#.z%)('���w5&�J%'o&n'H�(3×s5�p&�J � ����"� ¶ #��%�!34�p�*�: %u¿�.¸ à )�z"C ��'*I*&�&(�4&�J%',I*�� %)(&��(#��: 8&() � > � #. %/�¿ Í » à C"'*2p�� %u8)5&(��&cJ%'�u�����z"7���w�)(��DyIH#�2p2}'*/RI*�% � '*�7"���.u%�(#.3$34�! %u]7"����C%2p'H34)X��>5J%'H��'cwi����'.K�&(��I*�� %)(&c�(z%I*&4)(��wi&~3�#.��u.�! rI*2:#�)()(�|{Z'H�$�% %'RIH#. ]'*�p&�J%'H�)(�.2 � '�&�J%'�I*�� � '*��7"���.u%�(#.3$34�! %u�7"����C%2p'H3 �! �&�J%' B Dy/"�!34'H %)(�p�� "#�25)�7"#�I*'���w-&�J%'R7"#.�(#834'*&('H��)� K"�����% %'�IH#. �)(�82 � '�&�J%'xÈ�z"#�/Z�n#�&(�pI�7"���8u��(#.3$3~�: %u�7"����C%2p'H3 �! �&�J%'�/Zz"#�2 B6Ñ Ì )�7"#�I*'���wl&�J%'7"#.�(#834'*&('H��) � #. %/ � �,�y <��z"�@'*��7�'H���!34'H 8&()�s�'�I*�� %)(&c�(z%I*&-&cJ%'�)n��w�&,3$#.��u.�! <J.�"7"'H�(7%2!#. %'*)4C.�)(�.2 � �! %uo&�J%'�/Zz"#�2BÈ�z"#�/Z�(#�&(�pI�7"���.u��(#.3�34�! %u~7"����C%2p'H3R�Ì Í

References

Page 41: A quick introduction to Statistical Learning Theory and Support Vector Machines

� �� � ¯ � ±�² ±�� � ±�­��,± �c³�¯Lµ(±�­ ±��R¯ � �� ±@¯ �.� °B±x²$³�´T¯�µ�­�� � �¯¤³�° � � �5´ >5J%'R#�2pu.�����p&�J"3~)�/"'*)(IH���!C�'*/v�: r&�J%'R7"��' � �}�%z%)�)n'*I*&(�p�� %)4I*�� %)(&��(z%I*&4J.�"7"'H�n7%2:#8 %'*)��! �&�J%'��: "7"z%&)�7"#�I*'.�$>¤�RI*�� %)(&c�(z%I*&�#1J8��7�'H�(7%2!#. %'��: r#�w�'H#�&cz"��'4)�7"#�I*'$�� %'~{N��)(&�J"#�)�&(�R&c�(#. %)�wi���(3�&�J%'oÄ�D/"�!34'H %)(�p�� "#�2¤�: "7"z%& � '*I*&n���lÅ<�: 8&(�4#. o¾�Dy/"�!34'H %)(�p�� "#�2Zwi'H#�&�z"��' � '*I*&(���æ&�J"����z%u%J$#xI�J%�.�pI*',��w�#. ¾�Dy/"�!34'H %)(�p�� "#�2 � '*I*&(���æwyz" %I*&(�p�� ��T���� Ø�� � Õ=� �¾�/"�!34'H %)(�p�� "#�2T2p�! %'H#.��)('H7"#.�n#�&(��� £ #. %/R#4C%�!#�) � �})@&�J%'H �I*�� %)(&��nz%I*&('*/�w��%�6&cJ%'-)('*&���w&��(#8 %)�w��%�(34'*/ � '*I*&(����)

�l¿eÅ � Ã6Ê � + ¿eÅ � Ã*Á � ; ¿eÅ � Ã*Á ÕHÕXÕ Á � � ¿eÅ � Ã,Á % Ê Ì Á ÕXÕHÕ ÁJB Õ��2!#�)()n�ª{ZIH#�&(�p�� ]��w@#8 �z" "9% %��s- � '*I*&(�%��Å��p)4/"�� %'oC8��{N��)n&�&��(#8 %)�w��%�(34�! %uR&�J%' � '*I*&(���,&(�<&�J%')('H7"#.�n#�&(�! %u$)�7"#�I*'R¿eÅ� � �l¿eÅ Ãnà #. %/<&�J%'H �&�#.98�! %u$&�J%'�)n�}u% ���wÆ&cJ%',w�z" %I*&(�p�� �l¿eÅ ÃlÊ £ � �l¿eÅ ÃLÑ � Õ ¿� Ì Ã=,I*I*����/"�! %u4&(��&�J%'�7"����7�'H��&(�p'*)���wN&cJ%'-)(��w�&63$#8��u.�! 4I*2!#�)()(�|{Z'H�53~'*&�J%��/�&�J%' � '*I*&(��� £ IH#. $C�'s-���p&(&n'H R#�)-#42p�! %'H#.�@I*��34C%�: "#�&(�p�� R��wl)�z"7"7�����& � '*I*&n����)�¿e�! �&�J%'�w�'H#�&�z"��'�)�7"#�I*' à �Æ>@J"#�&-34'H#. %)£ Ê 7� � M + 6 � � � �l¿eÅ � Ã Õ ¿� Í Ã>@J%'$2p�: %'H#8���p&F�r��w5&�J%'$/"�.&�DF7"����/Zz%I*&4�!3$7%2p�p'*)HKl&�J"#�&x&�J%'$I*2!#�)()n�ª{ZIH#�&(�p�� rw�z" %I*&n�}�% ����! �¿� Ì Ãwi���5#. Rz" "9% %��s- � '*I*&(���GÅ �% %2}�R/"'H7�'H %/")��� �&�J%'�/"�.&�DF7"����/Zz%I*&n)H�

�l¿eÅ ÃlÊ �l¿eÅ Ã � £ Ñ ��Ê 7� � M + 6 � � � �l¿eÅ Ã � �l¿eÅ � ÃBÑ � Õ ¿� Ã>5J%'��p/"'H#,��wZI*�� %)(&c�(z%I*&(�! %u�)�z"7"7"�%��&�D � '*I*&(�%��)Æ %'*&�s����(98)BI*��3~'*)Tw����%3�I*�� %)n�}/"'H���: %uxu.'H %'H�(#�28wi���(34)��wl&�J%'�/"�.&�DF7"����/Zz%I*&5�! <# � �p2!C"'H��&�)�7"#�I*'oº Í ¼F�

�l¿�� à � �l¿�� Ã���� ¿�� Á � Ã Õ ¿��" Ã=,I*I*����/"�! %u$&n��&cJ%' � �p2:C�'H��&�DFO%I*J"34�p/"&�>5J%'*�����<ºª·X¼�#. 8�$)(�"3$34'*&����pI@w�z" %I*&n�}�% � ¿�� Á � à K�s5�p&�J� ¿�� Á � à :�� ; K�IH#. RC"'x'*��7"#. %/"'*/<�! R&�J%',wi���(3

� ¿�� Á � Ã�Ê �� � M + � � � � ¿�� à � � � ¿�� Ã,Á ¿��� ÃÌ

References

Page 42: A quick introduction to Statistical Learning Theory and Support Vector Machines

s-J%'H��' � � : �#8 %/�� � #.��'�'*�pu.'H � #�2:z%'*)�#8 %/1'*�pu.'H 8wyz" %I*&(�p�� %)� � ¿�� Á � à � � ¿�� Ã�� � Ê � � � � ¿�� à Õ��wl&�J%'��! 8&('*u��(#�2B��7"'H�n#�&(����/"'c{N %'*/]C.��&�J%'49�'H�( %'*2 � ¿�� Á � à �T=�)cz8©oI*�p'H .&�I*�� %/"�p&(�p�� �&(�$'H %)�z"��'&�J"#�&�¿��" à /"'c{N %'*)�#]/"�8&�DF7"���%/Zz%I*&o�! �#�wi'H#�&�z"��'1)�7"#�I*'<�p)R&�J"#�&o#�2p2,&�J%'�'*�pu.'H � #�2!z%'*)1�! �&�J%''*�"7"#. %)(�p�� ¿� � à #.��']7"�8)(�p&(� � '.� >B�vu%z"#.�(#. 8&('*'�&�J"#�&R&�J%'*)n' I*�%'c©oI*�p'H 8&()]#.��']7"�.)n�}&n� � '.K��p&R�}) %'*I*'*)()�#8���q#8 %/1)�z8©oI*�p'H 8&~¿ ¶ 'H��)('H���ø)5>5J%'*����'H3 à &�J"#�&5&�J%'�I*�% %/"�}&n�}�% ��� � ¿�� Á � Ã�� ¿�� Ã�� ¿ � Ã�� � � � ��¸�p)-)c#�&(�p)�{Z'*/�wi���5#�2p2 � )�z%I*J�&�J"#�& � � ; ¿�� Ã�� � � Õ¹Lz" %I*&(�p�� %)�&�J"#�&�)�#�&(�p)�wi� ¶ 'H��)('H���ø)�&�J%'*����'H3½IH#. $&cJ%'H��'cw��%��'-C"'�z%)('*/R#�)6/"�8&�DF7"���%/Zz%I*&()H�l=,�p�cD'H�(3$#8 LK�Ç��(# � 'H�(3$#8 #8 %/ EG�.�*�� %�%'H�$º Ì ¼6I*�% %)(�p/"'H��#�I*�% � �82:z%&n�}�% ���w5&�J%'4/"�.&�DF7"���%/Zz%I*&��! r&�J%'wi'H#�&�z"��'�)�7"#�I*'�u.� � 'H 1C8�$wyz" %I*&(�p�� ���wÆ&cJ%',w����n3� ¿�� Á � ÃTÊ '*��7��"Î Ò �qÎ � Ò� Á ¿�.· Ãs-J%�pI*JR&�J%'*��IH#�2p2 ¨ �8&('H .&n�:#�2N¹Lz" %I*&(�p�� %)H�� ��s�' � 'H�*K®&�J%'$I*�� � �.2!z%&(�p�� ���w�&�J%'4/"�.&�DF7"����/Zz%I*&x�: w�'H#�&cz"��'4)�7"#�I*'$IH#. rC"'$u.� � 'H rC.��#. .�wyz" %I*&(�p�� �)�#�&(�p)�wi�%�: %u�&�J%' ¶ 'H��)('H��'H��� )6I*�� %/"�p&(�p�� LK®�! R7"#.��&(�pIHz%2!#.��&(�4I*�� %)n&��(z%I*&57"�.2p�" %��34�!#�2BI*2!#�)�D)(�|{Z'H�@��wÆ/"'*u%��'*' � �! <Ä"Dy/"�!34'H %)n�}�% "#�2l�! "7"z%&-)�7"#�I*'��� %'�IH#. Rz%)n'�&�J%',wi�.2p2p�és@�: %u�wyz" %I*&(�p�� � ¿�� Á � ÃBÊ ¿�� � � Ñ Ì Ã�� Õ ¿�.» ÃM@)(�! %u�/"�4�B'H��'H 8&,/"�.&�DF7"���%/Zz%I*&() � ¿�� Á � à �� %'xIH#. RI*�� %)n&��(z%I*&,/"�4�B'H��'H 8&,2}'H#8�( %�! %u13$#�I*J%�: %'*)s5�p&�J]#8�(C%�p&��(#.���p2p�]&��"7"'*)���wG/"'*I*�p)(�p�� r)�z"��wy#�I*'*)�º4X¼F�$>5J%'�/"'*I*�p)(�p�� r)�z"��wy#�I*'$��w5&�J%'*)('�3$#�I*J%�: %'*)J"#�)-#�w����n3�l¿eÅ ÃlÊ 7� � M + 6 � � � � ¿�Å Á Å � Ã�Ás-J%'H��'�Å � �p)�&�J%'6�!3$#�u8'���wN#,)�z"7"7"����& � '*I*&(���¤�! ��! "7"z%&�)�7"#�I*'5#. %/ � � �p)l&�J%'�s�'*�pu�J8&T��wN#,)�z"7"7"�%��&� '*I*&(�%�6�! <&�J%',wi'H#�&�z"��'�)c7"#�I*'.�>¤��{N %/�&�J%' � '*I*&(����)6Å � #. %/�s�'*�}u%J.&()/� � �� %'5wi�.2p2p��s�&cJ%'-)�#834'5)(�.2!z%&(�p�� �)(I*J%'H34'�#�)æw����æ&�J%'�����pu.�! "#�2���7%&(�!3$#�263�#.��u.�! <I*2!#�)()(�|{Z'H�x����)n��w�&�3$#.��u.�! �I*2!#�)()(�|{Z'H�*�R>5J%'4�� %2p�]/"���B'H��'H %I*'$�p)4&�J"#�&�! %)(&('H#�/<��w63$#�&����p� � ¿e/"'*&('H�n34�! %'*/ C8�]¿ Ì � Ã(à �� %'�z%)n'*)5&�J%'�3�#�&����p�� � � Ê 6 � 6 � � ¿eÅ � Á Å � Ã�Á % Á��xÊ Ì Á ÕXÕHÕ Á � Õ

Ì "

References

Page 43: A quick introduction to Statistical Learning Theory and Support Vector Machines

� � ­ ° ��� � �G¯B³�° �� ± � � ³ �@� ±@°L¯ �� ´T¯L±5° � ¯���±5°�� ������ � 4 -� � +� � �! #- 2 �!��������� ��= #4 - +��&������5)��������4 +=����� ��� �!4 +�� � ��� 4 +� ��% ���"!#� �� -5�>¤�$I*�� %)(&��nz%I*&G#~)�z"7"7"����&�D � '*I*&(���� %'*&�s����(98)�/"'*I*�})n�}�% 1�(z%2p'��% %'�J"#�)5&(�4)(�.2 � '�#~È�z"#�/Z�n#�&(�pI���7%&(�|D34�p�H#�&(�p�� <7"����C%2p'H3R� � ¿ � Ã�Ê � Ï � Î ÌÍ � � Ï � � Ñ � ;� � Áz" %/"'H�@&�J%'�)(�!3$7%2p'4I*�� %)(&��(#��: 8&()H� � � � � � � Á� Ï Ê ¸ Ás-J%'H��'�3�#�&����p� � � � Ê 6 � 6 � � ¿eÅ � Á Å � Ã�Á % Á��xÊ Ì Á ÕXÕHÕ Á � Õ�p)�/"'*&n'H�(34�! %'*/�C8�-&cJ%'6'*2p'H34'H 8&()���w�&�J%'6&c�(#��! %�! %u-)('*&XKH#. %/ � ¿�� Á � à �p)�&cJ%'Tw�z" %I*&(�p�� �/"'*&('H�(34�! %�! %u&�J%'�I*�� � �.2!z%&(�p�� ���wl&�J%'�/"�.&�DF7"���%/Zz%I*&()H�>@J%'6)(�.2!z%&(�p�� x&(�,&�J%'���7%&(�!34�p�H#�&n�}�% �7"����C%2p'H3IH#. �C�'�w��%z" %/�'c©oI*�p'H 8&(2p�~C8��)(�.2 � �! %u��: 8&('H�(3~'cD/"�!#�&('���7%&n�:3~�}�H#�&(�p�� 17"���%C%2}'H3~)5/"'*&('H�(34�! %'*/�C.�$&cJ%'�&��(#��! %�! %u$/Z#�&�#�K�&cJ"#�&�IHz"�(��'H 8&(2p�oI*�� %)n&(�p&�z%&('&�J%'�)�z"7"7"�%��& � '*I*&n����)H�R>5J%�p)4&('*I*J" %�}È"z%'R�p)4/"'*)(IH���!C"'*/v�! �O%'*I*&(�p�� ���1>5J%'$��C%&c#��! %'*/���7%&n�:3�#�2/"'*I*�p)(�p�� �w�z" %I*&n�}�% R�p)�z" %�pÈ"z%'�$��% #�I*J���7%&(�!34�p�H#�&(�p�� <7"����C%2p'H3 IH#8 RC"'�)n�.2 � '*/1z%)(�! %uq#8 .��)(&�#. %/Z#8��/R&('*I*J" %�}È"z%'*)H���� � �����&�'�����4 + �(��� ��� �!4 +)� � ��� 4 +� � �� $+*0-� -,�� +��=$'&/. $��!�� �-��Ç���I�J"#. %u8�: %u<&�J%'4wyz" %I*&(�p�� � ¿�� Á � à wi���,&�J%'$I*�� � �.2!z%&(�p�� ��w�&�J%'�/"�.&�DF7"���%/Zz%I*&��� %'�IH#. <�:3�D7%2p'H34'H 8&-/"���B'H��'H 8&- %'*&�s��%�(9.)X��� �&�J%'� %'*�%&R)('*I*&(�p�� Ös�'rs5�p2}2xI*�� %)(�p/"'H�<)�z"7"7"�%��&�D � '*I*&(�%�R %'*&Fs����n9.)R3$#�I*J%�! %'*) &cJ"#�&1z%)('7"�82}�" %��34�!#�2�/"'*I*�p)(�p�� �)�z"�Fw�#�I*'*)X�o>¤�1)�7�'*I*�ªwi��7��.2p�� %��3~�:#�2})���w�/"���B'H��'H 8&��%��/"'H� � �� %'$IH#. �z%)('&�J%',wi�.2p2p�és@�: %u�wyz" %I*&(�p�� %)5wi����I*�% � �82:z%&n�}�% ���wÆ&cJ%'�/"�.&�DF7"���%/Zz%I*&

� ¿�� Á � ÃBÊ ¿�� � � Ñ Ì Ã�� ÕE5#�/"�!#�2lÇ�#�)(�p)5¹Lz" %I*&(�p�� 13$#�I�J%�! %'*),s5�p&�JR/"'*I*�p)(�p�� �wyz" %I*&(�p�� %)-��w�&�J%'�wi���(3�l¿eÅ ÃlÊ )n�}u% � Ø�� M + � � '*�"710 Ò ÅRÎ�Å

� Ò ;� ; 2 IH#. RC�'��!3$7%2p'H34'H 8&('*/]C.�Rz%)(�! %uoI*�% � �82:z%&n�}�% %)-��wl&�J%'�&��"7"'� ¿�� Á � ÃBÊ '*��7 0 Î Ò �1Î � Ò ;� ; 2 Õ3!( çéä �XäFù�âªüeâªènê&+ 1Hê�ù�áeâªè(êxâªü01HêXâ &�1éä % 1�áBê�è�áLâªáiü¤ä 3cå�ß(ê�üeâªè(ê$ènê�ü#1XåHå.ènúyá �cä�ùFáiènúyü�õ

Ì �

References

Page 44: A quick introduction to Statistical Learning Theory and Support Vector Machines

�y &�J%�p)�IH#�)('�&�J%'$)�z"7"7�����&�D � '*I*&n���� %'*&�s��%�(9R3$#�I*J%�! %'os5�p2p26I*�% %)(&��(z%I*&�C��.&�J�&�J%'�I*'H .&('H��)�Å � ��w&�J%'�#.7"7"���é�%�!3$#�&(�! %u4wyz" %I*&(�p�� <#. %/1&�J%'�s�'*�pu�J8&()$� � ��� %'@IH#. 4#�2p)(���! %I*���(7����(#�&n'-#�7"���p�����$9% %�és@2}'*/"u8'5��wN&cJ%'G7"����C%2p'H3 #�&6J"#. %/�C.�4I*�� %)(&c�(z%I*&(�! %u)�7�'*I*�:#�2�I*�% � �82:z%&n�}�% �wyz" %I*&(�p�� %)H�1O"z"7"7�����&�D � '*I*&(���� %'*&�s����(98)�#.��'�&�J%'H��'cwi����'o#<�(#�&�J%'H�,u.'H %'H�n#�2I*2!#�)()���wB2p'H#.�( %�! %uo3$#�I�J%�! %'*)�s-J%�pI�J�I*J"#. %u.'*)��p&()6)('*&���wB/"'*I*�p)(�p�� �wyz" %I*&(�p�� �)(�!3$7%2p�qC8�4I�J"#. %u8�: %u&�J%',wi���(3 ��wÆ&�J%'�/"�8&�DF7"���%/Zz%I*&H������ ������G4 +=��� � ��� � 45+ � � ��� 4 +� �� $.-� � 4 -5�!+!4 & 4����0� -�� +!$'&� �� $,�! #4 -%/ �� #&# � )>¤�vI*�� .&c���.2,&�J%'�u.'H %'H�(#�2p�p�H#�&(�p�� #.C%�p2p�}&�����w$#�2}'H#8�( %�! %u�3$#�I*J%�! %'*)1�� %'�J"#�)R&(��I*�� .&c���.2,&Fs��/"���B'H��'H .&xwy#�I*&(����)H��&�J%'$'H�n������DF�(#�&('4�� ]&cJ%'$&��(#��! %�! %u�/Z#�&�#<#. %/]&�J%'�IH#.7"#�I*�p&�����w5&�J%'o2p'H#.�n %�: %u3$#�I*J%�! %'�#�)G3~'H#�)�z"��'*/RC.�4�p&()�0-��Dy/"�!34'H %)n�}�% �º Ì "X¼F�l>5J%'H��','*���p)(&5#xC"��z" %/�wi���l&cJ%'�7"���%C"#.C%�p2}�p&����wl'H�(������)��% �&�J%'�&('*)n&�)n'*&5��wl&�J%',w��82}2p��s5�! %u�w����n3R�Ts5�p&�J17"���%C"#.C%�p2}�p&�� Ì Î�$&�J%'��! %'*È"z"#�2p�p&F�¨ �H¿e&n'*)(&�'H�(����� à � ¹N��'*È"z%'H %I*��¿e&��(#��: %�! %u$'H�n����� ÃBÑ �G�� 8{Z/"'H %I*'��� .&('H� � #�2 ¿� � Ã�p) � #�2}�p/B���y �&�J%'�C���z" %/�¿� � à &cJ%'-I*�� 8{Z/"'H %I*'x�! .&('H� � #�2B/"'H7�'H %/")-�% $&�J%'�0,��Dy/"�!34'H %)(�p�� R��w�&�J%'2p'H#.�( %�! %uR3$#�I*J%�: %'.K®&�J%'� %z"34C"'H�@��wÆ'*2p'H34'H 8&()5�! �&�J%'�&��n#��! %�: %u�)('*&HK"#. %/<&�J%' � #�2:z%'x��w�L�>@J%'�&�s���wy#�I*&(����),�! �¿ � à w����n3�#�&��(#�/"'cDy���6�,&�J%'$)c3$#�2p2p'H��&�J%'$0,��Dy/"�!34'H %)(�p�� r��w�&�J%'�)('*&��wBwyz" %I*&(�p�� %),��wl&�J%'�2p'H#.�( %�! %uq3�#�I�J%�! %'.K"&cJ%'�)�3$#�2p2p'H�5&cJ%'�I*�� 8{Z/"'H %I*'��! 8&('H� � #�2�KLC"z%&5&�J%'�2!#.��u.'H�&�J%' � #�2!z%'���w�&�J%'�'H�n�����æw���'*È�z%'H %I*�.�=�u.'H %'H�n#�2Zs5#X��wi���6��'*)(�.2 � �! %u4&�J%�p)6&c�(#�/"'cDy���1s5#�)67"���%7"�.)('*/R#�)�&�J%'-7"���! %I*�!7%2p'���wB)(&��nz%I*&�z"�(#�2���p)�9$3~�: %�!34�p�H#�&n�}�% L�lwi���l&cJ%'-u.� � 'H �/Z#�&�#�)('*&��� %'�J"#�)6&(��{N %/1#�)(�82:z%&n�}�% �&�J"#�&�3~�: %�!34�p�*'*)5&cJ%'*�:�)�z"3R��=×7"#.��&(�pIHz%2!#.�4IH#�)n'o��w,)(&��(z%I*&�z"�n#�2����p)�9r34�! %�:3~�}�H#�&(�p�� �7"���! %I*�:7%2p'��p)4&�J%'1�,I*IH#.3xDFE5#��*���7"���! %I*�!7%2p'.�-9�'*'H7�&�J%',{Z)n&-&('H�n3 '*È"z"#�2B&n�$�*'H���o#. %/13~�: %�!34�p�*'$&cJ%'�)('*I*�� %/1�% %'.���&5�p)�9% %��s- �&�J"#�&5&�J%'�0,��Dy/"�!34'H %)(�p�� <��wl&�J%'�)('*&5��wl2p�! %'H#.�5�! %/"�pIH#�&(�%��wyz" %I*&(�p�� %)� ¿eÅ ÃTÊ )(�pu� �¿ £ � Å Ñ �*Ã,Á Ò Å Ò � � Is5�p&�J�{Z�%'*/ &cJ"��'*)�J%�.2p/ � �})�'*È"z"#�2l&(��&�J%'4/"�!34'H %)(�p�� "#�2p�p&��]��w6&�J%'4�! "7"z%&�)�7"#�I*'.� � ��s�' � 'H�*K®&�J%'0-��Dy/"�!34'H %)n�}�% 1��w�&�J%'�)cz"C%)('*&� ¿eÅ ÃTÊ )(�pu� B¿ £ � Å Ñ �*Ã,Á Ò Å Ò � �tÁ Ò £ Ò � ���¿e&�J%'�)('*&���w�w�z" %I*&n�}�% %)�s@�}&cJ�C���z" %/"'*/ %�%�(3���w�&�J%'$s�'*�pu�J8&() à IH#8 ]C"'�2p'*)()4&�J"#. <&�J%'�/"�:3~'H 8D)(�p�� "#�2p�p&��R��wl&�J%'��! "7"z%&-)�7"#�I*'�#. %/Rs5�p2p2T/"'H7"'H %/��� ��� �¹L����3 &�J%�p)~7��.�! .&���w � �}'*s &�J%'���7%&(�!3$#�2G3�#.��u.�! rI*2:#�)()(�|{Z'H�$34'*&�J%�%/v'*�%'*IHz%&('q#8 ��,I*IH#.3�DE5#��*�%��7"���: %I*�!7%2p'.�5�i&�9�'*'H7�&�J%'5{N��)(&6&('H�n3 ��w�¿ � à '*È"z"#�2L&(���*'H���R¿iC8�$)�#�&(�p)�wi�%�: %u4&�J%'@�: %'*È"z"#�2p�p&��¿ � Ãnà #. %/��p&G3~�: %�!34�p�*'*),&�J%',)('*I*�� %/�&('H�(3t¿�C.�o3~�: %�!34�p�*�! %u�w�z" %I*&(�p�� "#�2 £ �ø£ à ��>@J%�})534�! %�!34�p�H#XD&(�p�� R7"��' � 'H .&(),#. R� � 'H��Di{Z&n&(�! %u~7"����C%2p'H3R�� ��s�' � 'H�*KT' � 'H ��! �&�J%'RIH#�)('Rs,J%'H��'R&�J%'�&��(#��! %�! %u�/Z#�&�#1#.��'R)('H7"#.�(#.C%2p'R�� %'13�#H�r��C%&�#��! #RC"'*&n&('H�,u.'H %'H�(#�2p�p�H#�&(�p�� �#.C%�p2p�p&F�]C8�]34�! %�:3~�}�*�! %u�&�J%'$I*�% 8{Z/"'H %I*'o&('H�n3 �! �¿��� à ' � 'H �w�z"��&�J%'H��� <&�J%'4'*�"7"'H %)('���w6'H�(���%��)��� <&�J%'4&��(#��! %�! %u1)n'*&H���y &�J%'4)(��wi&�3$#.��u.�! <I*2!#�)()(�|{Z'H�x34'*&�J%�%/]&�J%�p)

Ì ·

References

Page 45: A quick introduction to Statistical Learning Theory and Support Vector Machines

IH#. RC�'�/"�� %'4C.��I�J%�%�.)(�! %uR#.7"7"����7"���:#�&(' � #�2!z%'*)-��wl&�J%'�7"#.�n#.34'*&('H� � �6�y �&�J%'�)�z"7"7�����&�D � '*I*&n��� %'*&�s����(98)�#�2pu.�����p&�J"3 �� %'�IH#. xI*�� .&c���.2�&�J%'�&��(#�/"'cDy� ��C�'*&�s�'*'H �I*��3$7%2p'*�%�}&��x��wZ/"'*I*�p)(�p�� $�(z%2p'5#. %/wy��'*È�z%'H %I*����w6'H�n�����,C.�<I�J"#. %u8�: %u<&�J%'$7"#.�(#.3~'*&('H� � KL' � 'H <�! �&�J%'$34����'xu.'H %'H�(#�2lIH#�)('4s-J%'H��'&�J%'H��'�'*���p)(&n)Æ %�5)n�.2!z%&(�p�� �s5�p&�J��*'H���,'H�(�����L�% ,&�J%'6&��n#��! %�: %u,)('*&H�T>5J%'H��'cwi����'�&�J%'l)�z"7"7"�%��&�D � '*I*&(�%��) %'*&�s����(94IH#. �I*�% .&����.2BC"�.&�Jqw�#�I*&(����)�wi����u.'H %'H�(#�2p�p�H#�&(�p�� �#.C%�p2p�p&F�R��wÆ&�J%'�2p'H#.�n %�: %u13�#�I�J%�! %'.�

Ì »

References

Page 46: A quick introduction to Statistical Learning Theory and Support Vector Machines

¹l�pu�z"��' �"� % �"#.3$7%2p'*)���w�&�J%',/"�.&�DF7"����/Zz%I*&�¿� � à s5�p&�J ��Ê Í �ÆO"z"7"7�����&57"#�&(&n'H�( %)5#.��'5�! %/"�pIH#�&('*/s5�p&�J�/"��z"C%2p'�I*�!��I*2p'*)HK®'H�(������)@s5�p&�JR#4IH���.)n)H�� ��� �� °Bµ�� ­5¯.����� ­ ���J� � µ �>¤� /"'H3~�� %)(&��(#�&('$&�J%'�)�z"7"7"�%��&�D � '*I*&(�%�� %'*&�s����(9�34'*&cJ%��/�s�'oI*�� %/Zz%I*&�&Fs��<&F�"7"'*)4��w,'*��7�'H���|D34'H 8&()H���]'$I*�� %)(&c�(z%I*&�#.��&(�|{ZI*�:#�26)('*&()x��wG7"#�&(&('H�( %)x�: r&�J%'o7%2!#. %'R#. %/]'*�"7"'H���:3~'H .&4s5�p&�J Í %//"'*u���'*'57��.2p�� %��3~�:#�2"/"'*I*�})n�}�% 4)�z"��wy#�I*'*)HK.#. %/4s�'6I*�% %/Zz%I*&l'*��7�'H���!34'H .&n)6s5�p&�Jx&�J%'���'H#�2|Dy2p�ªwi'�7"����C8D2p'H3 ��wÆ/"�pu.�p&���'*I*�.u� %�p&(�p�� L������ �������,+! #" � -5��� #-��!���� &�$'-��M@)(�! %uo/"�.&�DF7"����/Zz%I*&n)5��wl&�J%',w��%�(3

� ¿�� Á � ÃBÊ ¿�� � � Ñ Ì Ã�� ¿� � Ãs5�p&�J �xÊ Í s�'�I*�� %)(&c�(z%I*&6/"'*I*�p)(�p�� ��(z%2p'*)�w����T/"���B'H��'H 8&6)('*&n)l��w 7"#�&n&('H�( %)l�! $&cJ%'-7%2!#. %'.�6E�'*)�z%2p&()��wB&�J%'*)(','*�"7"'H���!34'H 8&()�IH#. �C"' � �p)�z"#�2p�p�*'*/ #8 %/q7"��� � �p/"'� %�pI*'��p2p2:z%)n&��(#�&(�p�� %)@��wB&�J%'�7���s�'H�æ��wB&�J%'#�2pu.�����}&cJ"3R� % �"#.3$7%2p'*)5#.��',)�J%��s- 4�! $¹l�pu�z"��' �"�l>5J%' Í I*2!#�)n)('*)5#.��'-��'H7"��'*)('H .&n'*/qC8�oC%2!#�I*9$#. %/s-J%�p&('-C"z%2p2p'*&()H���y ~&�J%'�{Zu�z"��'@s�'��! %/"�pIH#�&(',)�z"7"7"�%��&�7"#�&n&('H�( %)ls5�p&�J$#�/"��z"C%2p'5I*�!��I*2p'.KZ#. %/$'H�n������)s5�p&�J�#RIH���8)()H�R>5J%'$)(�82:z%&n�}�% %)o#.��'$�%7%&(�!3$#�2l�! �&�J%'$)('H %)n'o&�J"#�&� %� Í %/�/"'*u���'*'o7"�.2p�" %��34�!#�2p)'*�%�})n&5&�J"#�&�3$#.9�',2p'*)()�'H�(������)X� � �.&(�pI*'�&�J"#�&6&�J%'� %z"34C"'H��)���w�)cz"7"7"����&57"#�&n&('H�( %)5��'*2!#�&(� � ',&(�4&�J%' %z"3�C�'H�5��w�&��(#��! %�! %uR7"#�&(&('H�( %),#.��'�)�3$#�2}2i���� � �������,+! #" � -5��� � 3�!��0 2� 3� �1��� 4'2 -� 3�! �4 -��z"��'*��7�'H���!34'H .&n)$wi���4I*�� %)n&��(z%I*&(�! %u�)�z"7"7"����&�D � '*I*&(���� %'*&Fs����n9.)$3$#.9�'1z%)('1��w�&�s�� /"���B'H��'H 8&/Z#�&�#8C"#�)('*)�wi���xC%�}&�DF3$#.7"7"'*/�/"�pu.�p&~��'*I*�.u� %�p&(�p�� LK�#R)�3$#�2}25#. %/v#R2!#.��u.'o/Z#�&�#.C"#�)('.�<>5J%'o)c3$#�2p2

Ì �

References

Page 47: A quick introduction to Statistical Learning Theory and Support Vector Machines

7 7 4 8 0 1 4

8 7 4 8 7 3 7¹l�pu�z"��'�·"� % �"#.3$7%2p'*)5��w67"#�&(&('H�n %)5s5�p&�J�2!#.C"'*2p),wy����3 &�J%'4M-O ¨ �.)n&�#�2BO%'H� � �}I*'x/"�pu.�p&-/Z#�&c#.C"#�)('.��� %'5�p)5#�M,O ¨ �.)(&�#�2"O%'H� � �pI*'�/Z#�&�#.C"#�)n'�&cJ"#�&6I*�� 8&�#��! %)�»"K4.¸.¸,&c�(#��! %�! %u~7"#�&(&('H�( %)6#. %/ Í Kª¸8¸.¸,&('*)(&7"#�&(&n'H�( %)H��>@J%'-��'*)(�.2!z%&(�p�� x��wL&�J%'�/Z#�&�#.C"#�)('��}) Ì · � Ì ·,7%�p��'*2p)HK"#. %/4)(��3~'�&��"7%�pIH#�2�'*�"#.3$7%2p'*)�#.��')�J%��s- ��! �¹l�}u%z"��'�·��Æ�� �&cJ%�})�/Z#�&c#.C"#�)(',s�'���'H7�����&�'*�"7"'H���!34'H 8&�#�2B��'*)('H#8��I�J�s5�p&�J�7"�.2p�" %��34�!#�2p)��w � #.���}�%z%)5/"'*u���'*'.�>@J%'62!#.��u.'l/Z#�&�#.C"#�)('�I*�� %)(�p)(&()T��wZ·.¸"Kª¸8¸.¸�&��(#��! %�! %u�#8 %/ Ì ¸"K|¸.¸.¸�&('*)(&�7"#�&(&('H�( %)HK�#. %/x�})l# �.¸XD �.¸34�p�%&�z"��'���wB&�J%' � �yO">��,&��(#��: %�! %uo#8 %/o&('*)n&�)n'*&()H�l>5J%'���'*)(�.2!z%&(�p�� ���wT&�J%'*)('�7"#�&(&('H�( %)��p) Í � � Í ��%�}'*2p/"�! %u]#. R�! "7"z%&x/"�:3~'H %)(�p�� "#�2p�p&F�r��w�» ��"Z�,�� <&�J%�p)�/Z#�&�#.C"#�)n'�s�'�J"# � 'x�� %2p� I*�% %)(&��(z%I*&('*/�#".&�J~/"'*u���'*'�7"�.2p�" %��34�!#�2LI*2!#�)()(�|{Z'H�*�6>@J%'�7�'H��w��%�(3$#. %I*'5��w &cJ%�})�I*2!#�)()n�ª{Z'H���p)�I*��3$7"#.��'*/o&(���8&�J%'H�&���7�'*)5��wl2p'H#.�( %�! %uq3�#�I�J%�! %'*),&�J"#�&5&n����9�7"#.��&5�! 1#�C"'H %I*J"3$#.�(9�)(&cz%/"� º "X¼F��� �#�2p25�%z"�$'*�"7"'H���!34'H 8&()R&('H �)n'H7"#.�(#�&(�%��)HK��� %'Rwi���4'H#�I*J�I*2!#�)()HK5#8��'1I*�� %)n&��(z%I*&('*/B� % #�I*JJ8��7�'H��Dy)�z"��wy#�I*'�3$#.9�'�z%)(',��wB&�J%'-)c#.34',/"�.&�7"���%/Zz%I*&5#. %/R7"��'cDF7"���%I*'*)()(�! %u$��w�&cJ%'5/Z#�&�#"�l��2!#�)n)(�|D{ZIH#�&(�p�� ���wÆ#. �z" "9� %��s- R7"#�&n&('H�( %)��p)5/"�� %'�#�I*I*����/"�! %u$&(�4&�J%'�3�#����!34z"3 ��z%&�7"z%&���w�&cJ%'*)('�&('H I*2!#�)()(�|{Z'H��)X�� [��¤[ � � Å� ¡�V�� À ¡��ZUHS4£�!U���� ��,¥LS*U*W � L¡�V � �eY�¡ � W"U*WZQ¤WZS*¡>5J%'�M,O ¨ �.)(&�#�2 O%'H� � �pI*' � #�&�#.C"#�)('�J"#�)-C�'*'H 1��'*I*�%��/"'*/�w�����3 #�I*&�z"#�2B3$#��p2�7%�p'*I*'*)�#. %/ ��'*)�z%2p&()wy����3�&cJ%�})4/Z#�&c#.C"#�)('oJ"# � 'RC"'*'H ���'H7�����&('*/vC.�r)(' � 'H�(#�2���'*)('H#.��I�J%'H��)X�1�y �>�#8C%2}' Ì s�'$2p�})n&$&�J%'7"'H�Fw����n3$#. %I*'5��w � #.���}�%z%)lI*2:#�)()(�|{Z'H��)�I*�.2p2p'*I*&('*/$wy����3�7"z"C%2p�pIH#�&(�p�� %)5#. %/��és, �'*�"7"'H���:3~'H .&()X�6>5J%'��'*)�z%2p&���w�J�z"3$#8 $7"'H��wi���(3�#. %I*'5s5#�)6��'H7�����&('*/oC8���®��Ç�����342p'*�o? % �.O"#�I*9.�! %u.'H�@º �X¼F�T>5J%'���'*)�z%2p&s5�p&�J���=�ET>�sG#�)$IH#.�(���p'*/���z%&oC8� � #.���%2 ¨ ��'*u.�!C"�� �#. %/ ¶ �}I*J"#�'*2 � ��E��p2p'*��#�&RÇ�'*2p2�AB#8C%)HK¶ z"�(�(#X� � �p2p2iK � �Z�">5J%'���'*)�z%2p&()5��w¤� "Z� ��#8 %/o&�J%'�C�'*)(& Í Dy2!#X�.'H�� %'Hz"�(#�2N %'*&�s����(9R¿es@�}&cJ$��7%&n�:3�#�2 %z"3�C�'H�5��w�J%�}/"/"'H rz" %�}&n) à s�'H��'x��C%&�#��! %'*/<)�7"'*I*�!#�2p2p��wi����&�J%�p)�7"#.7"'H��C8�1�������! " "#�������&n'*)-#. %/Ç�'H�( "#.��/�O%I*J%��'*2!9��%78w���'*)�7"'*I*&(� � '*2p�.�×>5J%']��'*)�z%2p&1s5�p&�J�#�)�7�'*I*�!#�2�7"z"�(7��.)('� %'Hz"�n#�2� %'*&�s��%�(9#.��I*J%�p&('*I*&�z"��'xs5�p&�J��42!#X�.'H��)HK"AL' � '*& Ì K®s5#�)���C%&�#��! %'*/�C.����®AL'H�5z" ]gXb �����@º � ¼F��� r&�J%'R'*�"7"'H���!34'H 8&()$s@�}&cJ]&�J%'1M,O ¨ �.)(&�#�26O%'H� � �pI*' � #�&�#.C"#�)('�s�'Rz%)('*/�7"��'cDF7"����I*'*)()n�: %u¿eI*'H 8&('H���! %uZK"/"'cDy)n2:#8 .&(�! %uR#. %/R)�34�%�.&�J%�! %u à &(�4�: %I*�%�(7"���n#�&('�9% %�és@2}'*/"u8'�#.C���z%&�&�J%'��! � #.���!#. %I*'*)� T ßcáeâªè(ê�ß(÷ :êéüiáeâªá#1�áiä +!ènú��*áißnê �Xß(ú �Xü�ßnê � ( ä�ùiçXêéèn÷ªè �cà ���cå.ä�ù�âªß(÷ZÞTß�áiß %éßcüiä6ïXõ

Ì �

References

Page 48: A quick introduction to Statistical Learning Theory and Support Vector Machines

��2!#�)n)(�|{Z'H� �(#Xs�'H�(���%�*K��� z"3$#. R7�'H��w��%�(3$#. %I*' Í � �� '*I*�p)(�p�� �&���'*'.KZ��=�ET> Ì »� '*I*�p)(�p�� �&���'*'.KZ�6"®� � Ì ·Ç�'*)(& Í 2:#X�.'H�@ %'Hz"�(#�2� %'*&�s����(9 ·"�ª·O"7�'*I*�:#�2Æ#.��I�J%�p&('*I*&�z"��' �42!#H�8'H�5 %'*&�s��%�(9 �"� Ì>T#.C%2p' Ì � ¨ 'H��wi���(3$#. %I*',��w � #.���}�%z%)-I*2!#�)()n�ª{Z'H��)�I*�.2p2}'*I*&n'*/1wy����3�7"z"C%2}�pIH#�&n�}�% %)�#. %/R��s- �'*��7�'H���|D34'H 8&()H�l¹®������'cwi'H��'H %I*'*)-)n'*'�&('*��&X�/"'*u���'*'x��w �(#Xs )cz"7"7"����& /"�:3~'H %)(�p�� "#�2p�p&F�<��w7��.2p�� %�%34�!#�2 'H�(�����*K�� � '*I*&n����) w�'H#�&cz"��'�)�7"#�I*'Ì Ì Í �|¸ Í ¸8¸ Í �.·Í "®�ª» Ì Í » ��.¸.¸.¸ "®� " Ì " � � Ì � Ì ¸ ?" "®�4 Ì · � � Ì � Ì ¸ �

� "®�4 Ì » � � Ì � Ì ¸ +F;· "®� Í Ì � � � Ì � Ì ¸ + �» "®�4 Ì � ¸ � Ì � Ì ¸ + ?>T#.C%2p' Í �BE�'*)cz%2}&n)l��C%&�#��! %'*/�wi���L/"�8&�7"���%/Zz%I*&()B��wZ7"�.2p�" %��34�!#�2p)l��w � #.���}�%z%)�/"'*u���'*'.�Æ>@J%'� %z"3�C�'H��()�z"7"7�����& � '*I*&(����)$�4�p)�#$34'H#. � #�2!z%'�7"'H�@I*2!#�)()(�|{Z'H�*���w�&�J%'o7"����C%2p'H3 #�&�J"#. %/B�o>@J%'$'��B'*I*&���w�)�3~���.&cJ%�: %u<��w�&cJ%�})x/Z#�&�#.C"#�)('$#�)�#17"��'cDF7"����I*'*)()n�: %uwi���5)cz"7"7"����&�D � '*I*&(���, %'*&�s��%�(9.),s5#�)5�! � '*)(&(�pu�#�&n'*/ �! vº�*¼��-¹®���@��z"�5'*�"7"'H���:3~'H .&()xs�'�I*J%�.)('4&�J%')�34�%�.&�J%�! %u$9�'H�( %'*2B#�)�#���#.z%)()n�:#8 $s5�p&�J�)(&�#8 %/Z#.��/$/"' � �!#�&(�p�� � Ê ¸ Õ »����! R#�u���'*'H34'H .&�s5�p&�J<º4X¼F��� x&�J%'6'*�"7"'H���!34'H 8&()�s5�p&�Jx&�J%�p)l/Z#�&�#.C"#�)n'6s�'6I*�� %)n&��(z%I*&('*/$7"�82}�" %��34�!#�2"�! %/"�pIH#�&(�%�¤w�z" %I*&(�p�� %)C"#�)('*/x�� �/"�8&�DF7"���%/Zz%I*&()���w�&�J%'æw��%�(3�¿ � à �B>5J%'��! "7"z%&l/"�:3~'H %)(�p�� "#�2p�p&F�4s5#�) Í �.·"KH#. %/x&�J%'6�%��/"'H���w�&cJ%'�7"�.2p�" %��34�!#�2��(#8 %u.'*/�w�����3 Ì &(�$»"�l>�#8C%2}' Í /"'*)(IH���!C"'*),&�J%'���'*)�z%2p&()@��wÆ&cJ%'-'*�"7"'H���!34'H 8&()H�>5J%'�&��n#��! %�: %u�/Z#�&�#�#.��'� %�.&52p�! %'H#.��2p�<)('H7"#.�(#.C%2p'.�� �.&(�pI*'.K,&�J"#�&�&�J%'� %z"34C"'H����w�)�z"7"7�����& � '*I*&n����)R�! %IH��'H#�)(' � 'H����)n2}��s52p�.� >5J%'�»�/"'*u���'*'7"�82}�" %��34�!#�2lJ"#�)5�% %2}� .¸���34����',)�z"7"7"�%��& � '*I*&(����)�&�J"#. �&cJ%'&.��/R/"'*u���'*'�7��.2p�� %�%34�!#�2 � #. %/' � 'H 2}'*)n)R&�J"#. �&�J%'�{N��)(&�/"'*u���'*' 7��.2p�� %�%34�!#�2i� >@J%'1/"�!34'H %)(�p�� "#�2p�p&�����w�&cJ%'Rwi'H#�&�z"��'<)�7"#�I*'wi����#R»�/"'*u���'*'o7��.2p�� %�%34�!#�26�p)�J%��s�' � 'H� Ì ¸ + � &(�!34'*)x2!#.��u.'H�,&�J"#8 �&�J%'4/"�!34'H %)(�p�� "#�2p�p&�����w�&�J%'wi'H#�&�z"��'5)c7"#�I*'�wi���l#&. %/�/"'*u���'*'�7"�.2p�" %��34�!#�2ZI*2!#�)n)(�|{Z'H�*� � �.&('@&�J"#�&67"'H�Fw����n3$#. %I*'-#�2!34�8)(&6/"�%'*) %�.&�I�J"#8 %u.'�s5�p&�Jv�! %IH��'H#�)n�: %u�/"�!34'H %)(�p�� "#�2p�p&�����w�&�J%'<)�7"#�I*' � �: %/"�pIH#�&n�: %u� %��� � 'H�FDi{Z&(&(�! %u7"����C%2p'H34)X�

Í ¸

References

Page 49: A quick introduction to Statistical Learning Theory and Support Vector Machines

4 4 8 5¹l�pu�z"��'�»"��AB#.C"'*2p'*/<'*��#.3�7%2}'*),��w�'H�n������)5�% $&�J%'�&c�(#��! %�! %uo)('*&Gw�����&�J%' Í %/R/"'*u���'*'�7"�82}�" %��34�!#�2)�z"7"7�����&�D � '*I*&n����I*2!#�)()(�|{Z'H�*�>@J%' ��'*2!#�&(� � '*2p��J%�}u%J� %z"34C"'H����w�)�z"7"7�����& � '*I*&(����)~w����~&�J%'12p�! %'H#.�R)n'H7"#.�(#�&(�%���p)R/Zz%'�&(� %�� 8Dy)('H7"#8�(#.C%�p2p�}&��L��&�J%'q %z"34C"'H� Í ¸.¸R�! %I*2!z%/"'*)qC��.&�Jr)�z"7"7"�%��& � '*I*&(�%��)�#. %/�&��(#��: %�! %u � '*I*&(����)s5�p&�Jv#] %�% 8Dy�*'H��� �D � #�2!z%'.����w � Ì &�J%'1&��n#��! %�: %u � '*I*&(���4�p)o3~�})nI*2:#�)()(�|{Z'*/���&�J%'1 %z"34C"'H�4��w34�p)�DyI*2!#�)()n�ª{ZIH#�&(�p�� %)o�% �&�J%'�&��(#��! %�! %u])('*&4# � 'H�(#�u.'*)x&(���"�7"'H�xI*2!#�)()(�|{Z'H�xwi����&�J%'o2p�! %'H#.�4IH#�)('.�¹Z�%�5# Í %/</"'*u���'*'�I*2!#�)n)(�|{Z'H�,&�J%'�&(�8&�#�2B �z"34C�'H�5��w�3~�})�DyI*2:#�)()(�|{ZIH#�&(�p�� %)��� �&�J%'�&c�(#��! %�! %uo)('*&@�})/"��s- �&(� "®�Æ>@J%'*)('&"o7"#�&(&('H�( %)5#.��'�)cJ%�és, ��: �¹l�pu�z"��'�»"���&o�p)R��'H3$#.�n9H#.C%2p'�&�J"#�&��! �#�2p2���z"�4'*�"7"'H���:3~'H .&()<&�J%' C���z" %/�w��%�4u.'H %'H�(#�2p�p�H#�&(�p�� #.C%�p2}�p&��¿�� à J%�.2p/")�s-J%'H �s�' I*�% %)(�p/"'H�1&cJ%' %z"34C"'H����w���C%&c#��! %'*/�)�z"7"7"�%��& � '*I*&(����)R�: %)n&('H#�/���w4&�J%''*�"7"'*I*&�#�&n�}�% � #�2!z%'���w�&cJ%�}), �z"34C"'H� �Æ�� R#�2p2LIH#�)('*)�&�J%'�z"7"7"'H�@C"��z" %/R�� �&�J%','H�(������7"���%C"#.C%�p2}�p&��wi���5&cJ%'�)(�! %u.2p'$I*2!#�)()n�ª{Z'H�,/"�%'*)� %�.&,'*��I*'*'*/ �� ¿e�� �&�J%'4&('*)(&5/Z#�&�#$&cJ%'~#�I*&�z"#�2B'H�(�����5/"�%'*)� %�.&'*�%I*'*'*/ Ì � � ��wi����&cJ%'�)(�! %u.2p'�I*2!#�)n)(�|{Z'H� à �>@J%'�&��(#��! %�! %uo&n�:3~'�wi����I*�� %)(&��(z%I*&n�}�% R��w67"�.2p�" %��34�!#�2TI*2!#�)()(�|{Z'H��),/"��'*)� %�.&5/"'H7�'H %/��� �&�J%'/"'*u���'*'���wÆ&�J%'�7��.2p�� %�%34�!#�2��o�� %2p��&�J%'� %z"3�C�'H����wl)�z"7"7"�%��& � '*I*&(�%��)H� % � 'H ��! �&�J%'�s�����)(&�IH#�)('�p&5�p)5wy#�)(&n'H�6&cJ"#. �&�J%'�C"'*)n&-7"'H��wi���(3~�: %u� %'Hz"�(#�2� %'*&�s����(9®K.I*�� %)n&��(z%I*&('*/R)�7�'*I*�!#�2p2}�<wi����&�J%'�&�#�)c9ZKAL' � '*& Ì º � ¼F�]>@J%'q7�'H��w��%�(3$#. %I*'R��w5&�J%�p)o %'Hz"�n#�25 %'*&Fs����n91�p) ��� Ì � �n#Hs 'H�(�����*� ¨ �.2p�� %��3~�:#�2})s5�p&�J�/"'*u���'*' Í ���@J%�}u%J%'H�5��z%&c7"'H��wi���(3 AL' � '*& Ì �

� [��¤[�� � Å� ¡�V�� À ¡��ZUHS4£�!U����� �� ¦¤W"U*WZQBWZS ¡>5J%' � ��O">�/Z#�&c#.C"#�)('�s5#�)5z%)('*/�w����@C"'H %I*J"3$#.�(9�)(&cz%/"�}'*)�I*�% %/Zz%I*&('*/1� � 'H� �(z%)(& Í s�'*'H98)H��>5J%'2p�!34�p&('*/v&(�!34'�w��(#834'$'H "#.C%2p'*/��� %2p�]&�J%'�I*�� %)(&��(z%I*&n�}�% ���w Ì &��"7"'���w�I*2!#�)()n�ª{Z'H� KTw��%��s-J%�pI*J]s�'I*J%�.)('R#�"8&�J�/"'*u���'*'R7"�82}�" %��34�!#�2@s5�p&�J� %�]7"��'cDF7"����I*'*)()n�: %u®����z"�xI�J%�8�}I*'Rs5#�)4C"#�)('*/��� r��z"�'*�"7"'H���p'H %I*'4s5�p&�JR&cJ%'�M-O ¨ �.)(&�#�2®/Z#�&�#.C"#�)('.�>T#.C%2p'$,2p�p)(&()�&�J%'5 �z"34C"'H�æ��wN)cz"7"7"����& � '*I*&(����)¤w��%�T'H#�I�J~��wZ&�J%' Ì ¸@I*2!#�)()(�|{Z'H��)5#. %/4u.� � '*)�&�J%'7"'H�Fw����n3$#. %I*'���w�&cJ%'-I*2!#�)()n�ª{Z'H�@�� �&�J%'�&��(#��: %�! %uo#8 %/R&('*)(&6)n'*&H� � �.&(�pI*'�&�J"#�&�' � 'H �7"�.2p�" %��34�!#�2p)��wZ/"'*u���'*' "o¿�&�J"#�&�J"# � '�34����'�&�J"#. Ì ¸ $ w���'*'57"#.�n#.34'*&('H��) à I*��3$34�p&l'H�(������)T�� �&�J%�p)l&��n#��! %�: %ux)('*&H�>5J%'4# � 'H�(#�u.'@w���'*È�z%'H %I*�R��wl&��(#��: %�! %u�'H�(������)5�p)�¸ Õ ¸ Í � � Ì Í 7"'H�@I*2!#�)()H��>5J%' Ì "o34�p)(I*2!#�)()(�|{Z'*/&('*)(&,7"#�&(&('H�( %)Gw����GI*2:#�)()(�|{Z'H� Ì #8��'�)�J%��s- R�! �¹l�}u%z"��' �"� � �.&(�pI*'$#�u�#��! RJ%��s�&�J%'4z"7"7"'H��C"�%z" %/¿�� à J%�.2p/")@w��%��&�J%'���C%&�#��: %'*/� �z"34C�'H����wl)�z"7"7"����& � '*I*&n����)H�>@J%'�I*��34C%�! %'*/]7"'H��wi���(3$#8 %I*'���w6&�J%'x&('H RI*2!#�)n)(�|{Z'H��)��� R&cJ%'�&('*)(&,)('*&5�p) Ì � Ì � 'H�(���%�*��>@J%�})��'*)�z%2p&4)�J%��z%2p/vC"'RI*��3�7"#.��'*/]&(�<&�J"#�&x��w5�.&�J%'H��7"#8��&(�pI*�!7"#�&(�! %u]I*2!#�)()(�|{Z'H��)4�! �&�J%'oC�'H %I�J"3�#.�(9Í Ì

References

Page 50: A quick introduction to Statistical Learning Theory and Support Vector Machines

�G2���¸ ��2i� Ì ��2i� Í ��2i� ��2i� " ��2i� � �G2���· ��2i�6» ��2i� � ��2i� �O"z"7"7L��7"#�&(&H� Ì .» � � � � Ì � � � Ì � ¸.¸ Ì Í.Í " Í ¸ Í " Ì � Í » Í ¸.·�" Í � Í Í ».·��% �(���%�6&c�(#��! » Ì · � Ì8Ì Í " � Ì · " Ì% �n������&('*)(& Ì � Ì " � � .· " � Í " " � ·�>T#.C%2p' ��5E�'*)�z%2p&()��%C%&�#��! %'*/<w����,# ".&�JR/"'*u%��'*'$7"�.2p�" %��34�!#�2lI*2!#�)()(�|{Z'H���� R&�J � �yO">/Z#�&c#.C"#�)('.�>5J%'�)(�p�*'4��w�&cJ%'�&��(#��: %�! %u$)n'*&5�p)�·.¸�Kª¸.¸.¸"K%#. %/1&cJ%'�)(�p�*'���wl&�J%'�&('*)(&@)('*&5�p) Ì ¸"K|¸.¸.¸47"#�&(&('H�n %)H�

1 6 1 9 6 6 1

9 1 1 1 1 1 1¹l�pu�z"��'����->5J%' Ì "134�p)(I*2!#�)()(�|{Z'*/�&('*)(&�7"#�&(&n'H�( %)�s5�p&�J<2!#.C"'*2p)xw��%�,I*2:#�)()(�|{Z'H� Ì � ¨ #�&(&n'H�( %)�s5�p&�J2!#.C"'*2 � Ì �$#.��',w�#�2p)('4 %'*u�#�&(� � '.� ¨ #�&(&('H�( %)�s5�p&�J��8&�J%'H��2!#.C"'*2p)�#.��',w�#�2p)('�7��.)(�p&(� � '.�

Í.Í

References

Page 51: A quick introduction to Statistical Learning Theory and Support Vector Machines

linearclassifier

LeNet1 LeNet4 SVN

Testerror

1%

2%

1.11.1

8.4

2.41.7

k=3−nearestneighbor¹l�}u%z"��' � �6E�'*)cz%2}&n)5wy����3 &�J%'4C"'H %I*J"3$#.�(9�)(&�z%/"�.�

)(&�z%/"�.�l>5J%'*)n'5�.&�J%'H�TI*2!#�)()n�ª{Z'H��)��! %I*2!z%/"'�#,2p�! %'H#.��I*2:#�)()(�|{Z'H�*K"#�9 Ê éDF %'H#.��'*)(&6 %'*�pu�J%C"����I*2!#�)()n�ª{Z'H�s5�p&�J$·8¸"Kª¸.¸.¸,7"���.&n�.&���7�'*)HK.#. %/$&�s��� %'Hz"�(#�2L %'*&Fs����n9.)T)�7"'*I*�!#�2p2p��I*�� %)(&��(z%I*&n'*/4w��%�l/"�pu.�p&���'*I*�8u� %�|D&(�p�� R¿iAL' � '*& Ì #. %/$AL' � '*&2" à ��>5J%'5#.z%&�J%�%��)��% %2}�4I*�� 8&����!C"z%&('*/4s5�p&�J4��'*)cz%2}&n)Tw����¤)�z"7"7�����&�D � '*I*&n��� %'*&�s����(98)H�l>5J%'���'*)�z%2p&,��wÆ&cJ%'�C"'H %I*J"3$#.�(9R#.��'-u.� � 'H ��! R¹l�pu�z"��' � ��r'5I*�� %I*2!z%/"'-&cJ%�})�)('*I*&(�p�� �C.�4I*�p&(�! %u�&cJ%'-7"#.7"'H�,º "X¼L/"'*)(IH���!C%�! %u~��'*)�z%2p&()6��wB&�J%',C"'H %I*J"3$#.�(9®�¹Z���@È�z%�p&('�#�2}�% %uR&(�!34'oAL' � '*& Ì s5#�)�I*�� %)(�p/"'H��'*/r)(&�#�&('4��w�&�J%'$#.��&H���*�*��>5J"���%z%u�J#�)('H���p'*)���w�'*�"7"'H���:3~'H .&()��! �#.��I*J%�}&n'*I*&�z"��'.K�I*��34C%�: %'*/�s5�p&�J�#. v#. "#�2p�%)(�p)o��w�&�J%'I�J"#.�n#�I*&('H���p)(&(�pI*)5��wæ��'*I*�8u� %�p&(�p�� R'H�(�����*K"AL' � '*&2"$s5#�)5IH�(#éw�&('*/B���X�*�>5J%'4)�z"7"7"����&�D � '*I*&(���� %'*&�s����(9RJ"#�)�'*��I*'*2p2p'H .&$#�I*IHz"�n#�I*�.KBs-J%�pI*J��p)�34�.)(&���'H3$#.�(9�D#.C%2p'.KÆC�'*IH#.z%)('Rz" %2p�:9�'R&�J%'$�.&�J%'H�xJ%�pu�J�7�'H��wi���(3$#. %I*'�I*2:#�)()(�|{Z'H��)HKl�p&x/"��'*)$ %�.&��! 8DI*2:z%/"'R9% %�és@2}'*/"u8'q#.C���z%&�&�J%'$u8'*��34'*&�������w�&�J%'o7"���%C%2}'H3��q�� �wy#�I*&�&�J%'�I*2:#�)()(�|{Z'H�s��%z%2}/R/"��#�)�s�'*2p2B�|wl&�J%'��!3$#�u.'�7%�p�%'*2p)�s�'H��'-'H %IH���"7%&('*/<'.�øuZ�lC.�R#x{Z�%'*/BKZ�(#. %/"�%37"'H�(34z%&�#�&n�}�% L�>@J%'62!#�)(&���'H3$#.�(9@)�z%u.u.'*)n&B&�J"#�&¤w�z"��&�J%'H�L�!3$7"��� � 'H34'H 8&B��wZ&cJ%'�7"'H�Fw����n3$#. %I*'6��w�&�J%'�)�z"7"7"�%��&�D� '*I*&(�%�5 %'*&�s��%�(94IH#. <C"'�'*�"7"'*I*&n'*/Rwy����3 &�J%'xI*�� %)(&��nz%I*&(�p�� ���wTw�z" %I*&(�p�� %)@wi����&�J%'x/"�.&�DF7"���%/Zz%I*&� ¿�� Á � à &�J"#�&5��'��Z'*I*&�#$7"���p�����B�! 8w����n3$#�&(�p�� R#.C���z%&5&�J%'�7"����C%2p'H3t#�&,J"#. %/B�

� � ±�­�´���³ � µ(±�­>5J%�p)�7"#87"'H�,�! .&c����/Zz%I*'*)x&�J%'4)�z"7"7"�%��&�D � '*I*&(�%�� %'*&Fs����n9R#�)�#o %'*s 2p'H#.�( %�! %u 3$#�I�J%�! %'4wi���@&Fs���Du����%z"7�I*2:#�)()(�|{ZIH#�&(�p�� �7"����C%2p'H34)X�Í

References

Page 52: A quick introduction to Statistical Learning Theory and Support Vector Machines

>@J%'�)�z"7"7"����&�D � '*I*&(���, %'*&�s����(94I*��34C%�! %'*)�$�p/"'H#�)H��&�J%'�)(�82:z%&n�}�% 1&n'*I�J" %�pÈ"z%'�wy����3 ��7%&n�:3�#�2J8��7�'H�(7%2!#. %'*)�¿e&�J"#�&l#�2p2p�és@)Tw����T#. x'*�"7"#. %)(�p�� 4��wL&�J%'�)(�.2!z%&(�p�� � '*I*&(���¤�� x)�z"7"7"����& � '*I*&(�%��) à KX&�J%'�p/"'H#R��w5I*�� � �.2!z%&(�p�� ��w�&�J%'�/"�.&�DF7"���%/Zz%I*&q¿e&cJ"#�&�'*��&n'H %/")�&�J%'$)n�.2!z%&(�p�� �)�z"��wy#�I*'*)�w����%3×2}�! %'H#.�&(�o %�% 8Dy2}�! %'H#.� à KN#8 %/R&�J%'� %�.&(�p�� ���w�)(��wi&-3$#8��u.�! %)�¿e&(�$#�2p2p��s�wi����'H�(���%��)5�� �&�J%'�&c�(#��! %�! %uo)('*& à �>@J%'�#�2pu.�����}&cJ"3tJ"#�)�C"'*'H <&('*)n&('*/ #. %/�I*��3�7"#.��'*/1&n�$&�J%'$7"'H��wi���(3�#. %I*'���w6�8&�J%'H��I*2!#�)()(�pIH#�2#�2pu.�����}&cJ"34)H� � '*)�7%�p&('�&�J%'�)(�!3$7%2p�pI*�}&��$��wZ&�J%'�/"'*)(�pu� 4�! 4�p&()l/"'*I*�p)(�p�� 4)�z"�Fw�#�I*'�&�J%'5 %'*s�#�2pu.�%���p&�J"3'*�"J%�:C%�p&()�# � 'H���4{N %'47"'H��wi���(3$#8 %I*'��! R&�J%'�I*�%3$7"#.���p)(�� �)n&�z%/"�.��,&cJ%'H�lI*J"#.�(#�I*&('H���})n&(�pI*)62p�!9�'-IH#87"#�I*�p&F�4I*�� 8&����82Z#. %/$'H#�)(�! %'*)()���wBI�J"#8 %u.�! %u�&�J%'5�!3$7%2p'H34'H 8&('*//"'*I*�p)(�p�� �)�z"��wy#�I*'-��'H %/"'H�6&cJ%'5)�z"7"7"�%��&�D � '*I*&(�%�l %'*&Fs����n9�#. �'*��&c��'H34'*2p�$7"��s�'H��w�z%2L#. %/oz" %� � 'H��)c#�22p'H#.�( %�! %uR3$#�I*J%�: %'.�

Í "

References

Page 53: A quick introduction to Statistical Learning Theory and Support Vector Machines

� � ±�­ � ¯Z°T³�´T¯Nµ�­ � � �� ��° ��¯Lµc­ � �� �� ° � ���,­ ���y <&�J%�p)�#.7"7"'H %/"�p��s�'�/"'H��� � '$C"�.&cJR&�J%'434'*&�J%�%/Rwi���5I*�% %)(&��(z%I*&(�! %u���7%&(�!3$#�2TJ.�"7"'H�(7%2!#. %'*)4#. %/)(��wi&53$#.��u8�: <J.�"7"'H�n7%2:#8 %'*)H�/ �#� ����! #"%$'&�(*)����,+!��&�$.-��0/1&32�45+! 3�!��"�i&@sG#�)5)�J%��s- ��: <)('*I*&n�}�% Í K"&�J"#�&�&(��I*�� %)(&��(z%I*&@&�J%'���7%&(�!3$#�2TJ.�"7"'H�(7%2!#. %'£ � � Å Ñ � � Ê ¸ Á ¿#"�¸ Ãs-J%�pI*JR)('H7"#.�n#�&('�#4)('*&@��w�&��(#��: %�! %u�/Z#�&�#¿ 6 + Á Å + Ã*Á ÕXÕHÕ Á ¿3687 Á Å'7 Ã�Á�� %'�J"#�)@&(�o34�! %�!34�p�*'$#xw�z" %I*&n�}�% "#�2

� Ê £ �*£ Á)�z"C ��'*I*&5&(��&�J%'�I*�� %)(&��n#��! .&()6 � ¿eÅ � �*£ Ñ �*ÃA> Ì Á % Ê Ì Á ÕXÕHÕ ÁCB Õ ¿#" Ì Ã>¤�4/"�o&cJ%�})@s�'�z%)n'�#4)(&c#. %/Z#.��/���7%&(�!34�p�H#�&(�p�� <&('*I*J" %�pÈ�z%'.���]'�I*�� %)n&��(z%I*&-#$A¤#�u��(#. %u8�:#8

� ¿ £ Á ��Á � ÃBÊ ÌÍ £ �*£ Î 7�� M + � � º 6 � ¿�Å � �c£ Ñ �*Ã Î Ì ¼ Á ¿#" Í Ãs-J%'H��' � Ï Ê ¿ � +HÁ ÕHÕHÕ Á � 7 à �p)�&�J%' � '*I*&(��� ��w� %�% 8DF %'*u�#�&(� � '6AB#�u��(#. %u.'634z%2p&(�!7%2p�p'H��)6I*���n��'*)�7"�% %/"�: %u&(�$&cJ%'�I*�� %)(&��(#��: 8&()�¿ " Ì Ã ���&o�p)19% %��s- v&�J"#�&o&�J%'�)(�82:z%&n�}�% �&n��&�J%'<��7%&(�!34�p�H#�&(�p�� 7"����C%2p'H3 �p)R/"'*&('H�(34�! %'*/�C.��&�J%')�#�/"/"2p'$7"�8�: 8&���w6&�J%�p)�AB#�u��n#. %u.�!#. R�! <&�J%' Í B6Ñ Ì Dy/"�!34'H %)n�}�% "#�2l)�7"#�I*'4��w £ K � KN#8 %/ � K®s-J%'H��'&�J%'�34�! %�:34z"3 )�J%��z%2p/$C"'�&�#.9�'H ,s5�p&�Jx��'*)�7�'*I*&l&(�5&�J%'67"#.�n#.34'*&('H��) £ #. %/ � Ké#8 %/�&�J%'�3$#����!34z"3)�J%��z%2p/�C"'�&c#.9�'H �s5�p&�JR��'*)c7"'*I*&,&(�$&�J%'�AB#�u��(#. %u.'�34z%2p&(�!7%2p�p'H��) � �=@&�&cJ%'�7"�8�: 8&5��wl&�J%'�34�! %�!34z"3 ¿es5�p&�JR��'*)c7"'*I*&5&(� £ #. %/ �*à �� %'x��C%&�#��! %)H�

�� ¿ £ Á ��Á � Ã

� £ ���� � M ��� Ê ¿ £�� Î 7� � M + � � 6 � Å � ÃlÊ ¸ Á ¿#" Ã

�� ¿ £ Á���Á � Ã

� � ���� � M � � Ê �

���

6 � � � Ê ¸ Õ ¿#" " ùL����3 '*È"z"#�2p�}&���¿#" à s�'�/"'H��� � '£�� Ê 7� � M + � � 6 � Å � Á ¿#"�� Ã

Í �

References

Page 54: A quick introduction to Statistical Learning Theory and Support Vector Machines

s-J%�pI*J1'*�"7"��'*)()n'*)HKL&�J"#�&,&�J%'��%7%&(�!3$#�2lJ.�"7"'H�(7%2!#. %'�)(�.2!z%&(�p�� <IH#. �C"'4s-���p&(&n'H 1#�)�#$2p�! %'H#.��I*��3xDC%�! "#�&(�p�� ���w5&��n#��! %�: %u � '*I*&n����)H� � �.&('.KT&�J"#�&x�� %2p�]&��(#��! %�! %u � '*I*&n����)�Å � s@�}&cJ�� � �v¸<J"# � 'o#. '��B'*I*&(� � '�I*�% .&����:C"z%&n�}�% �&(�$&�J%'�)�z"3 ¿#"�� à �O"z"C%)n&(�p&�z%&(�! %u ¿ "�� à #. %/�¿#" " à �! 8&(� ¿#" Í Ã s�'��%C%&�#��! � ¿ � Ã Ê 7� � M + � � Î ÌÍ £�� �c£�� ¿#"�· Ã

Ê 7� � M + � � Î�ÌÍ 7�� M +7�� M + � � � � 6 � 6 � Å � � Å � Õ ¿#"�» Ã�y � '*I*&(���� %�.&c#�&(�p�� �&�J%�p),IH#. RC"'4��'*s-���p&(&('H R#�)� ¿ � ÃlÊ � Ï�� Î ÌÍ � Ï � � Á ¿#" � Ãs-J%'H��' � �p)6#. � Dy/"�!34'H %)(�p�� "#�2Zz" %�}& � '*I*&(��� Ké#. %/ � �p)6#,)(�"3$34'*&����pI B �AB DF3$#�&����p�xs5�p&�Jx'*2}'H3~'H .&()� � � Ê 6 � 6 � Å � � Å � Õ>¤�R{N %/�&�J%'$/"'*)(�!��'*/�)�#�/"/"2p'R7"�.�! 8&��p&4��'H3$#��! %)�&n�12p��IH#�&('$&�J%'R3$#��%�!3�z"3 ��w�¿ " � à z" %/"'H�x&�J%'I*�� %)(&c�(#��! .&n)�¿#" à � Ï Ê ¸ Ás-J%'H��' Ï Ê ¿36 + Á ÕXÕHÕ Á 687 à K"#. %/ � > ¸ Õ>@J%'���z"J" 8DF>Tz%I*9�'H�x&�J%'*����'H3 7%2:#X�%)�#. <�!3$7�����&�#. 8&�7"#.��&x�! �&�J%'$&�J%'*�%���1��w���7%&(�!34�p�H#�&(�p�� L�=,I*I*����/"�! %u1&(�R&cJ%�})x&�J%'*����'H3�K�#�&���z"�,)c#�/"/"2p'q7��.�! .&x�: £ � Á � � Á � � KB#. 8�]AB#�u��n#. %u.'$34z%2}&n�:7%2p�p'H�� �� #. %/R�p&()-I*�%�(��'*)�7��� %/"�! %uoI*�� %)n&��(#��! 8&-#.��'�I*�� " %'*I*&n'*/ C8�q#. �'*È"z"#�2p�p&��� � º 6 � ¿eÅ � � £�� Ñ�� � Ã Î Ì ¼ Ê ¸ Á % Ê Ì Á ÕHÕHÕ ÁJB Õ¹N����3 &�J%�p)-'*È"z"#�2p�p&F�RI*�%34'*)5&�J"#�&, %�� 8Dy�*'H��� � #�2!z%'*) � � #.��'��� %2p�R#�I*J%�}' � '*/1�! <&�J%'�IH#�)('*)5s,J%'H��'

6 � ¿eÅ � �*£ � Ñ � � Ã Î Ì Ê ¸ Õ�y ��.&�J%'H��s�����/")H� � � �Ê ¸4�� %2p��w��%�6IH#�)('*)-s�'H��'�&cJ%' �þj®g��X^���� �eb��,�})�34'*&-#�)-#. �g��H^ ��� �þb����l�]'xIH#�2p2� '*I*&(�%��)�Å � wi����s,J%�}I*J 6 � ¿eÅ � �*£�� Ñ � � Ã6Ê Ìwi����)�z"7"7�����&�D � '*I*&(����)H� � �.&n'.K�&cJ"#�&��! �&�J%�p)4&('H�(34�! %�.2p�.u8�]&�J%'$'*È"z"#�&(�p�� �¿#" � à )(&�#�&n'*)�&�J"#�&x&�J%')(�.2!z%&(�p�� � '*I*&(��� £�� IH#8 RC"'�'*�"7"#. %/"'*/��� �)�z"7"7�����& � '*I*&(����)H�=� %�.&�J%'H�T�%C%)('H� � #�&(�p�� LK"C"#�)('*/$�� x&�J%'���z"J" 8DF>�z%I*9�'H��'*È�z"#�&(�p�� 1¿#" " à #. %/ ¿#"�� à w����T&cJ%'5��7%&(�|D3$#�2L)(�82:z%&n�}�% LK"�p)5&�J%'���'*2!#�&(�p�� %)cJ%�:7<C"'*&�s�'*'H �&�J%'�3$#��%�!3$#�2 � #�2:z%' � ¿ ��� à #. %/�&�J%'�)('H7"#.�n#�&(�p�� /"�p)(&�#. %I*' F � � £�� �*£ � Ê 7�� M + � �� 6 � Å � �*£ � Ê7� � M + � �� ¿ Ì Î 6 � � � ÃlÊ

7� � M + � �� ÕÍ ·

References

Page 55: A quick introduction to Statistical Learning Theory and Support Vector Machines

O"z"C%)(&(�p&�z%&n�: %u�&�J%�p),'*È�z"#�2p�p&��R�! .&(��&�J%'�'*��7"��'*)()(�p�� �¿ "�· à w��%� � ¿ � � à s�'���C%&c#��! � ¿ � � ÃlÊ 7� � M + � �� Î�ÌÍ £ � �c£ � Ê £�� � £��Í Õ>T#.9.�! %u��! .&(��#�I*I*��z" 8&5&�J%'�'*�"7"��'*)()(�p�� �¿ Ì Ã wy����3 O%'*I*&(�p�� Í s�'��%C%&�#��!

� ¿ ��� ÃlÊ ÍF ;� Ás-J%'H��' F � �p)-&cJ%'�3$#.��u.�! �wi����&�J%'��%7%&(�!3$#�2BJ.�"7"'H�n7%2:#8 %'.�/ �3� � 4���� . $'+=2� #-�(*)���� +!��&#$'-��*/ & 2�4 +! �!��"Ç�'*2p�éss�'�{N��)n&5I*�� %)(�p/"'H�,&�J%'xIH#�)('���w É ¿�� Ã6Ê ���8�5>5J%'H Rs�'�/"'*)nIH���!C"'4&�J%'�u8'H %'H�(#�2���'*)cz%2}&,wi���#$34�% %�.&(�� %�pI�I*�� � '*��w�z" %I*&(�p�� É ¿�� à �>¤�4I*�� %)(&��nz%I*&-#4)(��wi&-3$#8��u.�! �)('H7"#.�(#�&(�! %uoJ.�"7"'H�n7%2:#8 %'�s�'�3$#��%�!34�p�*'�&cJ%',w�z" %I*&(�p�� "#�2

� Ê ÌÍ £ �*£ Ñ�� � 7� � M + � � Á � � Ì Áz" %/"'H�@&�J%'�I*�� %)(&��n#��! .&()6 � ¿�Å � �c£ Ñ �*ÃA> Ì Î � Á % Ê Ì Á ÕHÕHÕ ÁCB~Á ¿#" � Ã

�> ¸ Á % Ê Ì Á ÕXÕHÕ ÁCB Õ ¿��.¸ Ã>5J%'�AB#�u��(#. %u.',wyz" %I*&(�p�� "#�2¤wi����&�J%�p)�7"���%C%2}'H3��p)

� ¿ £ Á Á ��Á � Á��oÃLÊ ÌÍ £ �(£ Ñ� � 7� � M + � � Î7�� M + � � º 6 � ¿eÅ � ��£ Ñ �*Ã Î Ì Ñ � ¼�Î 7�� M + � � � Á ¿�� Ì Ãs-J%'H��'x&�J%'� %�� 8DF %'*u�#�&(� � '�34z%2p&(�!7%2p�}'H��) � Ï Ê ¿�� +HÁ � ;XÁ ÕHÕXÕ Á ��� à #.���p)('�w�����3 &�J%'�I*�� %)n&��(#��! 8&�¿#" � à K#. %/R&�J%'434z%2}&n�:7%2p�p'H��) � Ï Ê ¿ � + Á � ; Á ÕHÕXÕ Á � � à 'H 8wi����I*'�&cJ%'�I*�� %)(&c�(#��! .&�¿��8¸ à ��r'~J"# � '�&(��{N %/]&�J%'$)c#�/"/"2p'q7��.�! .&x��w5&�J%�p)�wyz" %I*&(�p�� "#�2�¿�&�J%'$34�! %�:34z"3 s5�p&�J]��'*)�7"'*I*&4&(�&�J%' � #.���!#.C%2p'*) £ � K � K®#. %/ � Á #. %/R&�J%'�3�#����!34z"3 s5�p&�JR��'*)�7"'*I*&,&(�4&�J%' � #.���!#.C%2p'*)�� � #. %/ �

� à �A '*&�z%)�z%)n'-&�J%'@I*�� %/"�p&(�p�� %)�w����æ&�J%'-34�! %�!34z"3 ��wB&�J%�p)lwyz" %I*&(�p�� "#�2B#�&l&�J%'5'*�%&���'H34z"3 7"�.�! 8&H���

� £ ���� � M � � Ê £ � Î 7� � M + � � 6 � Å � Ê ¸ Á ¿�� Í Ã��� � ���� � M � � Ê

7� � M + � � 6 � Ê ¸ Á ¿�� ÃÍ »

References

Page 56: A quick introduction to Statistical Learning Theory and Support Vector Machines

��

� � ���� � � M � �� Ê � � � 7�� M + �� � Ð + Î�� � Î �

� Õ ¿���" Ã�ewls�'�/"'H %�.&(' 7�� M + �� Ê � �� �

Ý��� Ý Á ¿���� Ãs�'�IH#. R��'*s-���}&n'�'*È"z"#�&(�p�� ]¿���" à #�) � Î � � Î �� Ê ¸ Õ ¿��.· ùN����3 &�J%'�'*È"z"#�2p�p&(�p'*)o¿�� Í Ã D�¿ ��� à s�',{N %/

£�� Ê 7� � M + � � 6 � Å � Á7� � M + � � 6 � Ê ¸ Á ¿��.» Ã� Ê � � Ñ �� Õ ¿�� � ÃO"z"C%)(&(�p&�z%&n�: %uv&cJ%'R'*��7"��'*)n)(�p�� %)�w��%� £�� K � � K5#. %/ � �: 8&(�]&cJ%' AB#�u%�(#. %u.'�wyz" %I*&(�p�� "#�2$¿�� Ì Ã s�'��C%&�#��:

� ¿ � Á ��Ã�Ê 7�� M + � � Î�ÌÍ 7� � M +7�� M + � � � � 6 � 6 � Å � � Å � Î � ����� Ð +¿ � �4à + ��� Ð + � Ì Î Ì

� Õ ¿�� � Ã>¤�,{N %/$&�J%'�)(��w�&63$#8��u.�! 4J.�"7"'H�(7%2!#. %',)(�.2!z%&(�p�� ~�� %'-J"#�)l&(��3�#����!34�p�*',&�J%'�w��%�(3 w�z" %I*&n�}�% "#�2¿�� � à z" %/"'H�B&cJ%'6I*�� %)(&c�(#��! .&n)�¿��.» à D�¿�� � à s5�p&�Jx��'*)�7"'*I*&T&(�,&�J%'6 %�� 8DF %'*u�#�&(� � ' � #.���!#.C%2p'*) � � Á � � s5�p&�J% Ê Ì Á ÕHÕXÕ Á � �T�y � '*I*&(���� %�.&�#�&(�p�� ]¿�� � à IH#. <C"'���'*s,���p&(&('H 1#�)� ¿ � Á ��ÃTÊ � Ï � Î � ÌÍ � Ï � � Ñ � ����� Ð +¿ � ��à + ��� Ð + � Ì Î Ì

� � Á ¿i·.¸ Ãs-J%'H��' � #. %/ � #.��'�#�)�/"'c{N %'*/1#.C"� � '.�l>¤��{N %/�&�J%'-/"'*)n�:��'*/R)�#�/"/"2p'�7��.�! .&��� %',&�J%'H��'cwi����',J"#�)&(�4{N %/R&cJ%'�3$#��%�!34z"3 ��w-¿i·.¸ à z" %/"'H�@&�J%'�I*�� %)(&��n#��! .&()� Ï� Ê ¸ Á ¿i· Ì Ã� Ñ � Ê � � Á ¿i· Í Ã� >�]Á ¿i· Ã#. %/� > � Õ ¿i·�" ÃÍ �

References

Page 57: A quick introduction to Statistical Learning Theory and Support Vector Machines

¹L����3�¿i· Í Ã #8 %/�¿i·�" à �� %'x��C%&�#��! %)5&�J"#�&�&cJ%' � '*I*&(�%� � )�J%�%z%2}/<)�#�&(�p)�wi��&�J%'�I*�� %/"�p&(�p�� %)� � � � � � Õ ¿i·�� ùN����3 I*�� %/"�p&(�p�� %)o¿i· Í Ã #. %/�¿�·�" à �� %'�IH#8 R#�2p)(�$I*�� %I*2!z%/"'4&�J"#�&�&(�o3�#����!34�p�*'R¿i·.¸ Ã�-Ê � ��� Ê 3$#�� ¿�� +*Á ÕHÕXÕ Á � 7 à ÕO"z"C%)(&(�p&�z%&n�: %u�&�J%�p) � #�2!z%'���w � �! .&n� ¿i·.¸ à s�'���C%&�#��:

� ¿ � ÃlÊ � Ï � Î � ÌÍ � Ï � � Ñ � ����� Ð +���¿ � �4à + ��� Ð + � Ì Î Ì� � Õ ¿i·.· Ã>¤��{N %/]&�J%'$)(��w�&�3$#.��u8�: rJ.�"7"'H�(7%2!#. %'��� %'4IH#. �&cJ%'H��'cw��%��'4'*�}&cJ%'H�,{N %/]&�J%'�3$#��%�:34z"3×��w�&�J%'È"z"#�/Z�(#�&(�pI�wi���(3 ¿�� Ì Ã z" %/"'H�T&cJ%'6I*�� %)(&c�(#��! .&n)G¿i· Ì Ã #. %/1¿i·�� à KH�%�B�� %'�J"#�)l&(��{N %/4&�J%'53$#����!34z"3��w�&�J%'$I*�� � '*�Rwyz" %I*&(�p�� �¿i·.¸ à z" %/"'H��&cJ%'$I*�� %)(&��n#��! .&()o¿�· Ì Ã #. %/�¿��.· à ��¹®���,&�J%'$'*�"7"'H���:3~'H .&()��'H7�����&('*/]�! r&�J%�p)$7"#.7"'H��s�'�z%)('*/ ��Ê Í #. %/])(�82 � '*/]&�J%'�È�z"#�/Z�n#�&(�pIo7"���.u%�(#.3$34�! %u17"���%C%2}'H3¿�� Ì Ã �¹®���l&cJ%'5IH#�)(',��w É ¿�� Ã�Ê ��&�J%'5)c#.34',&('*I�J" %�pÈ"z%'�C"���! %u.)5z%)�&(��&�J%'�7"����C%2p'H3 ��wB)(�.2 � �! %u$&�J%'wi�.2p2}��s5�! %u�È�z"#�/Z�(#�&(�pI���7%&(�!34�p�H#�&n�}�% 17"����C%2p'H3R��34�! %�!34�p�*'4&�J%'�wyz" %I*&(�p�� "#�2

� ¿ � ÃlÊ � Ï � Î ÌÍ � Ï � � Áz" %/"'H�@&�J%'�I*�� %)(&��n#��! .&() � � � � � � Á#. %/ � Ï Ê ¸ Õ>@J%'�u8'H %'H�(#�2l)(�.2!z%&(�p�� w����@&�J%'$IH#�)('4��wG#R34�% %�.&(�� %'xI*�� � '*�Rwyz" %I*&(�p�� É ¿ � à IH#. �#�2p)(�1C�'��C%&�#��: %'*/�w�����3 &�J%�p)-&('*I*J" %�pÈ�z%'.��>5J%'x)(��wi&G3�#.��u.�! RJ8��7�'H�(7%2!#. %'$J"#�)-#xwi���(3

£ Ê 7�� M + � � 6 � Å � Ás-J%'H��' � Ï� Ê ¿ � � Á Õ!Õ!Õ Á � �7 à �}),&�J%'�)(�82:z%&n�}�% R��wl&�J%'�wi�.2p2p��s5�! %uo/Zz"#�2¤I*�� � '*�q7"���8u��(#.3$3~�: %u�7"����C8D2p'H3R��3$#��%�:3~�}�*'x&�J%',wyz" %I*&(�p�� "#�2� ¿ � ÃlÊ � Ï � Î Ë ÌÍ � Ï � � Ñ � � ���� � Ð + ¿ � ����� à Π� É � � Ð + ¿ ������ Ã Ó Áz" %/"'H�@&�J%'�I*�� %)(&��n#��! .&() � Ï Ê ¸ ÁÍ �

References

Page 58: A quick introduction to Statistical Learning Theory and Support Vector Machines

� >�]Ás-J%'H��'�s�'�/"'H %�.&n'�Æ¿ � ÃlÊ É�� ¿�� à չZ�%��I*�� � '*�q3~�� %�.&(�� %'@wyz" %I*&(�p�� %) É ¿ � à s5�p&�J É ¿i¸ Ã.Ê ¸~&�J%',w��82}2p��s5�! %u��: %'*È"z"#�2p�p&��1�p) � #�2}�p/B�� É � ¿�� à � É ¿�� à Õ>5J%'H��'cwi����'@&�J%'-)n'*I*�� %/o&n'H�(3��! �)(È�z"#8��'�C"�n#�I�9�'*&()6�p)57"�.)n�}&n� � '�#. %/�u.�%'*)6&(�x�! 8{N %�}&���s-J%'H � ����u.�%'*)5&(�$�! 8{N %�p&��.�

¹l�! "#�2p2p�.K®s�'�IH#8 �I*�� %)(�p/"'H�5&cJ%'�J8��7�'H�(7%2!#. %'�&�J"#�&-34�! %�!34�p�*'*)�&�J%'�wi���(3ÌÍ � £ �*£ Ñ 7� � M + ;� )�z"C ��'*I*&,&(�$&�J%'xI*�� %)(&��(#��: 8&()�¿ " � à D�¿��8¸ à K.s,J%'H��'�&�J%'x)('*I*�� %/<&('H�(3 34�! %�!34�p�*'*)�&cJ%'�2p'H#�)(&,)(È"z"#.��'� #�2:z%'�w��%�6&cJ%'�'H�(������)X�Æ>@J%�}),2p'H#�/R&(�4&�J%',wi�.2p2p�és@�: %u�È"z"#�/Z�(#�&(�pI�7"���8u��(#.3$3~�: %u$7"���%C%2}'H3���3$#����|D34�p�*'�&cJ%',w�z" %I*&(�p�� "#�2 � ¿ � ÃlÊ � Ï � Î ÌÍ Ë � Ï � � Ñ Ì

�� Ï � Ó ¿i·.» Ã�! R&�J%'� %�% 8DF %'*u�#�&(� � 'xÈ�z"#�/Z�n#. .& � > ¸~)�z"C ��'*I*&5&(�4&�J%'�I*�% %)(&��(#��! 8& � Ï Ê ¸"�� � ° ­�´ ��º Ì ¼ ¶ ��=-�p�*'H�(3$#8 LK % ��Ç��n# � 'H�(3$#. LK5#. %/�Al��E��.�*�� %�%'H�*��>@J%'*����'*&(�pIH#�2@w��%z" %/Z#�&(�p�� %)���w�&�J%'7��.&('H 8&(�!#�2%w�z" %I*&n�}�% $34'*&�J%�%/4�: 47"#�&n&('H�( 4��'*I*�.u� %�p&(�p�� x2p'H#.�( %�! %uZ����^%bi`����.b��e`.j��.j ���g ��`8big

�6`8jZbia�`��pK Í �"�4� Í Ì � � .»�K Ì � ·�"Z�º Í ¼�>��L���L=� %/"'H��)(�� r#. %/]E��BE���Ç�#.J"#�/Zz"�*���G2:#�)()(�|{ZIH#�&(�p�� ��! 8&(�1&�s��R3�z%2p&(� � #.���!#�&('o %�%�(3$#�2/"�p)(&����:C"z%&n�}�% %) s@�}&cJ�/"���B'H��'H 8&RI*� � #.���!#. %I*']3$#�&����pI*'*)H����j®j �� �.b�� ��Lb��.b �ÔK$ "� " Í ¸ � " Ì KÌ � ·.·"�º4é¼�Ç,� % �"Ç��.)('H�*K%�(����z%�8�� LK"#. %/o04� � �%06#.7" %�!9Z�T=�&c�(#��! %�! %u~#�2}u8�����p&�J"3½w��%�6�%7%&(�!3$#�2N3�#.��u.�! I*2!#�)()n�ª{Z'H��)H�N�� ��a�`Hhcg�g ��ej �8\6`��5b���g�� � �*b�����jZjZ^ �����R`.anmH\ ��`c_1`����6`��@_N^%b��.b��e`.j"�����Bg �.acj"�ej �� �%g�`.a ��K � �.2!z"34'��"K"7"#�u.'*) Ì " " � Ì � Í K ¨ �}&n&()�C"z"��uZK Ì � � Í �8=-� ¶ �º "�¼�Al�®Ç6�8&(&(��zLK"���"������&('*)HK �Z�ZOL� � 'H "9�'H�*K � � � �(z%I*9�'H�*K"�(� ��z%�.�� LK"Al� � ���%#�I�9�'*2�K ���"AL'H�5z" LK% ��O"#�I*98�: %u8'H� ¨ �6O%�!3$#.��/BK�0���06#.7" %�!9ZK,#. %/�Mx�l=4� ¶ �p2}2p'H�*�����3$7"#8���p)(�� ���w�I*2!#�)()n�ª{Z'H�34'*&cJ%��/")H��=�IH#�)('$)(&�z%/"�<�! �J"#. %/"s,���p&(&('H �/"�pu.�p&$��'*I*�.u� %�p&(�p�� L����a�`Hhcgcg���ej �8\�`������.b����cj"dbigXa�j �.b��þ`8j"������`.j �*gHa�gHj®h�gx`.j�� �.bibigXa�j���g�hc` �%j"�eb��þ`.j��8j ��gX^�a ������gXbikl`.anméK Ì ��� "Z�º �é¼ �Z�lÇ�����3~2}'*�v#. %/ % �lO"#�I*9.�! %u.'H� � � 'Hz"�(#�2|DF %'*&�s����(9]#. %/v9�DF %'H#.��'*)(&�DF %'*�pu�J%C"�%�4I*2!#�)()(�|{Z'H��)H�>¤'*I�J" %�pIH#�2lE�'H7�����& Ì.Ì �� � D �"Ì ¸ � Ì � D Ì ·.> ¶ K�=6>5?�>�K Ì � �"Ì �.¸

References

Page 59: A quick introduction to Statistical Learning Theory and Support Vector Machines

ºª·é¼�E��B�G��z"�(#. 8&�#8 %/ � � � �}2!C�'H��&H� � gXb���` �\�`�� � �8b���g ���.b��eh � ��� ���X\��þh \����y 8&('H��)nI*�}'H %I*'.K � '*s�l���n9ZK Ì � � "�ºª»é¼�E��%=4�%¹l�})cJ%'H�*��>5J%'�z%)n'-��w�34z%2p&(�!7%2p'�34'H#�)cz"��'H34'H 8&()��: �&c#����% %��34�pI-7"����C%2p'H34)X� ��jZj � � ^8d�%gHj"�eh*\*K"»"� Ì.Ì8Ì � Ì Í K Í Ì � .·��º4�é¼ �4��AL'H�@z" L�NM, %'-7"���%I*'*/Zz"��'5/��ª#87"7"��'H .&n�})n)�#�u.'-7���z"�l��'*)('H#8z$#,)('Hz%�p2N#�)n)(��3~'*&����pÈ�z%'.� �6` ��j"�Ôdb��ef ������� � � ���Ba�`.j®b��þgXaFg .g �� ��j®big ��� � �%gHjZhcg���acb�� �h �eg ���}g .g*\ Lh �egHj®h�g \�.g � ����`.jZj ���!\c\��.jZhcg.g \���gX^�a�`�\*h �egHj®h�g \�K%7"#�u.'*) � � � ��·.¸ "ZK ¨ #.���}) Ì � ���"�º � ¼ �4��A 'H�5z" LK5Ç,�6Ç��.)('H�*K��Z��OL� � 'H "9�'H� K � � � 'H %/"'H��)(�% LK5E�� % � � �és5#.��/BKl��� � z"C"C"#8��/BK#. %/1Aæ� � � ��#�I�9�'*2i� � #. %/"s-���}&n&('H �/"�pu.�p&-��'*I*�.u� %�p&(�p�� �s@�}&cJR#�C"#�I�9�DF7"����7"#�u%#�&(�p�� � %'*&Fs����n9Z��� � .f �.jZhcg*\ �ej���gX^�a ������j ��`.a ���.b��e`.j ��a�`Xh�g \�\��þj � ��H\*big �x\*K � �.2!z"34' Í KZ7"#�u8'*) � · � "�¸�"®�¶ ����u%#. ��#8z8w�3$#8 LK Ì � � ¸��º Ì ¸é¼ � �ZÇ,� ¨ #.�(9�'H� �6AL'H#.�( %�! %u�2p�.u.�pI.�5>B'*I*J" %�pIH#�26E�'H7�����&�>5E�D "�»"K®��'H .&n'H��w��%�5����3$7"z%&c#�&(�p�� "#�2E�'*)n'H#.��I*J��: % I*�% %��34�pI*)5#. %/ ¶ #. "#�u.'H34'H 8&�O%I*�p'H %I*'.K ¶ #�)()�#�I�J%z%)('*&(&()��y %)(&n�}&cz%&('-��w¤>¤'*I*J8D %�.2p�.u8�.KZ�5#.34C"���p/"u.'.K ¶ =4K Ì � � �"�º Ì.Ì ¼�¹5�ZEG�.)('H %C%2:#�&(&H� ��a �ejZh �ø_ �pg \�`�����gH^%a�` ���j �����þh \��6O"7"#8��&�#.3 Ç����%9.)HK � '*s �l�%�(9ZK Ì � · Í �º Ì Í ¼ � � % �"E5z"34'*2!J"#.��&HK �$� % � � �! .&n�� LKZ#. %/1E�� �Z�"���p2}2p�!#.34)X��AL'H#8�( %�! %uo�! 8&('H�( "#�2���'H7"��'*)('H 8&�#XD&(�p�� %)�C8�qC"#�I*9%7"����7"#�u�#�&(�! %u�'H�n������)H� � �.bi^%a�gHK Í "� � � � � .·%K Ì � �.·"�º Ì é¼ � � % ��E5z"34'*2!J"#.��&HK ��� % � � �! .&n�� LK�#. %/�E�� �Z�B���p2p2p�:#834)H�RAL'H#.�( %�! %u��! .&n'H�( "#�26��'H7"��'*)n'H 8D&�#�&(�p�� %)�C8��'H�(�����x7"����7"#�u�#�&(�p�� L�$�y ��#.34'*)4Al� ¶ IH��2p'*2p2!#. %/�#. %/ � # � �p/ % �BE5z"34'*2!J"#.��&HK'*/"�p&(����)HK � �.a �����pg ��� �Ô\*bia �� *^%big����a�`Hhcg*\c\��þj �����Z` �}^ �4g ��K�7"#�u.'*) Ì � ��.· Í � ¶ �y> ¨ ��'*)n)HK Ì � �.»"�º Ì "�¼�04� � �806#.7" %�!9Z� � \*b������8b��þ`.j<`�����g�_�gHj .gHj®h�g \�� ��\�g $`.j � �5_��þa �þh ����� �.b��.K%=-/"/"'H %/Zz"3 Ì �� '*s �l�%�(9Z�lO"7"���! %u.'H�FDy0l'H��2!#�uZK Ì � � Í �

Ì

References

Page 60: A quick introduction to Statistical Learning Theory and Support Vector Machines

Support Vector Learningvorgelegt vonDiplom-Physiker, M.Sc. (Mathematics)Bernhard Sch�olkopfaus StuttgartVom Fachbereich 13 | Informatikder Technischen Universit�at Berlinzur Erlangung des akademischen GradesDoktor der Naturwissenschaften{ Dr. rer. nat. {genehmigte DissertationPromotionsausschuss:Vorsitzender: Prof. Dr. K. ObermayerBerichter: Prof. Dr. S. J�ahnichenBerichter: Prof. Dr. V. VapnikTag der wissenschaftlichen Aussprache: 30. September 1997Berlin 1997D 83

References

UPCLab2013
Typewritten text
REF [4]
Page 61: A quick introduction to Statistical Learning Theory and Support Vector Machines

The thesis was published by: Oldenbourg Verlag, Munich,1997.

References

Page 62: A quick introduction to Statistical Learning Theory and Support Vector Machines

Support Vector LearningBernhard Sch�olkopfDissertation zum Dr. rer. nat. | Zusammenfassung der wesentlichen ErgebnisseInhalt der Arbeit ist das Lernen von Mustererkennung als statistisches Problem. EineLernmaschine extrahiert aus einer Menge von Trainingsmustern Strukturen, die ihrdie Klassi�kation neuer Beispiele erlauben. Die Arbeit behandelt folgende Fragen:� Welche \Merkmale" sollte man aus den einzelnen Trainingsmustern extrahieren?|Zum Studium dieser Frage wurde eine neue Form von nichtlinearer Hauptkom-ponentenanalyse (\Kernel PCA") entwickelt. Durch die Benutzung von Integral-operatorkernen kann in Merkmalsr�aumen sehr hoher Dimensionalit�at (z.B. im1010-dimensionalen Raum aller Produkte von 5 Pixeln in 16� 16-dimensionalenBildern) eine lineare Hauptkomponentenanalyse durchgef�uhrt werden. Im Ur-sprungsraum betrachtet, f�uhrt dies zu nichtlinearen Merkmalsextraktoren. DerAlgorithmus besteht in der L�osung eines Eigenwertproblemes, in dem die Wahlverschiedener Kerne die Verwendung einer gro�en Klasse verschiedener Nichtli-nearit�aten gestattet.� Welche der Trainingsmuster enthalten am meisten Information �uber die zu kon-struierende Entscheidungsfunktion? |Diese Frage, wie auch die folgende, wurdeanhand des vor wenigen Jahren von Vapnik vorgeschlagenen \Support-Vektor-Algorithmus" innerhalb des von Vapnik und Chervonenkis entwickelten statis-tischen Paradigmas des Lernens aus Beispielen untersucht. Durch die Wahlverschiedener Integraloperatorkerne erm�oglicht dieser Algorithmus die Konstruk-tion einer Klasse von Entscheidungsregeln, die als Spezialf�alle Neuronale Netze,Polynomiale Klassi�katoren und Radialbasisfunktionennetze enth�alt. F�ur Bildervon 3-D-Objektmodellen und handgeschriebenen Zi�ern konnte gezeigt werden,dass die verschiedenen Entscheidungsregeln in ihrer Klassi�kationsgenauigkeitden besten bisher bekannten Verfahren ebenb�urtig sind, und dass ihre Konstruk-tion lediglich eine kleine, von der speziellen der Kerne weitgehend unabh�angigeTeilmenge (in den betrachteten Beispielen 1% { 10%) der Trainingsmenge ver-wendet.� Wie kann man am besten \A-Priori"-Information verwenden, die zus�atzlich zuden Trainingsmustenr vorhanden ist? (beispielsweise Information �uber die In-varianz einer Klasse von Bildern unter Translationen) | die Arbeit schl�agtdrei Verfahren vor, die alle zu deutlichen Verbesserungen der Klassi�kation-sgenauigkeit f�uhren. Zwei der Verfahren bestehen in der Konstruktion vonspeziellen, dem Problem angepassten Integraloperatorkernen. Das dritte Ver-fahren verwendet Invarianztransformationen, um aus der oben genannten Teil-menge (der \Support-Vektor-Menge") aller Trainingsmuster zus�atzliche k�unst-liche Trainingsbeispiele zu generieren. genehmigt: Prof. J�ahnichen

References

Page 63: A quick introduction to Statistical Learning Theory and Support Vector Machines

References

Page 64: A quick introduction to Statistical Learning Theory and Support Vector Machines

ForewordThe Support Vector Machine has recently been introduced as a new technique for solv-ing various function estimation problems, including the pattern recognition problem.To develop such a technique, it was necessary to �rst extract factors responsible forfuture generalization, to obtain bounds on generalization that depend on these factors,and lastly to develop a technique that constructively minimizes these bounds.The subject of this book are methods based on combining advanced branches ofstatistics and functional analysis, developing these theories into practical algorithmsthat perform better than existing heuristic approaches. The book provides a compre-hensive analysis of what can be done using Support Vector Machines, achieving recordresults in real-life pattern recognition problems. In addition, it proposes a new formof nonlinear Principal Component Analysis using Support Vector kernel techniques,which I consider as the most natural and elegant way for generalization of classicalPrincipal Component Analysis.In many ways the Support Vector machine became so popular thanks to worksof Bernhard Sch�olkopf. The work, submitted for the title of Doktor der Naturwis-senschaften, appears as excellent. It is a substantial contribution to Machine Learningtechnology.Vladimir N. Vapnik, Member of Technical Sta�, AT&T Labs ResearchProfessor, Royal Holloway and Bedford College, LondonReferences

Page 65: A quick introduction to Statistical Learning Theory and Support Vector Machines

VorwortInteressant an der Arbeit von Herrn Sch�olkopf sind nicht nur die fachlichen Aspekte,sondern auch die unterschiedlichen und sehr intensiven Kontakte zu internationalenForschungseinrichtungen. Sie zeigen, da� der Autor sowohl in der Lage ist, seineErgebnisse im wissenschaftlichen Spitzenfeld zu pr�asentieren und zu plazieren, als auchaus Arbeiten der \Community" heraus seine Ergebnisse zu entwickeln. Aus dieser Sichtl�a�t sich auch die fachliche Qualit�at der Arbeit ersehen.Herr Sch�olkopf untersucht zwei Grundprobleme der Klassi�kation gro�er Daten-mengen. Zum einen ist dies die Extraktion weniger aber relevanter starker Merkmalezur Reduktion der Informations ut, und zum anderen die Beschreibung von Daten-beispielen, die charakteristisch f�ur ein gegebenes Klassi�kationsproblem sind. BeideProbleme werden von Herrn Sch�olkopf sowohl theoretisch als auch in Experimentenausgiebig und ersch�opfend untersucht. Sowohl die in der Arbeit entwickelte, sehrelegante Methode der nichtlinearen Merkmalsextraktion (kernel PCA), als auch dievorgeschlagenen Weiterentwicklungen der Support-Vektor-Maschine benutzen schwa-che Merkmale und setzen sich damit konzeptuell von der oben beschriebenen Philoso-phie der starken Merkmale ab. Somit spiegelt sich in der Arbeit gewisserma�en einParadigmenwechsel in der Klassi�kation und Merkmalsextraktion wider.Herr Sch�olkopf war w�ahrend seiner Dissertation ein gern gesehener Gast der GMDFIRST Berlin, und es war eine Freude, seine Arbeit zu lesen und zu betreuen. Ins-besondere freue ich mich, da� Herr Sch�olkopf seine Forschung in seiner neuen Positionbei GMD FIRST weiterf�uhren wird.Stefan J�ahnichen, Direktor, GMD FIRSTProfessor, Technische Universit�at BerlinReferences

Page 66: A quick introduction to Statistical Learning Theory and Support Vector Machines

ContentsSummary 111 Introduction and Preliminaries 151.1 Learning Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . 151.2 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . 211.3 Feature Space Mathematics . . . . . . . . . . . . . . . . . . . . . . . . 242 Support Vector Machines 332.1 The Support Vector Algorithm . . . . . . . . . . . . . . . . . . . . . . 332.2 Object Recognition Results . . . . . . . . . . . . . . . . . . . . . . . . 462.3 Digit Recognition Using Di�erent Kernels . . . . . . . . . . . . . . . . 562.4 Universality of the Support Vector Set . . . . . . . . . . . . . . . . . . 572.5 Comparison to Classical RBF Networks . . . . . . . . . . . . . . . . . . 612.6 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692.7 Why Do SV Machines Work Well? . . . . . . . . . . . . . . . . . . . . 763 Kernel Principal Component Analysis 793.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.2 Principal Component Analysis in Feature Spaces . . . . . . . . . . . . . 803.3 Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . 833.4 Feature Extraction Experiments . . . . . . . . . . . . . . . . . . . . . . 893.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964 Prior Knowledge in Support Vector Machines 994.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.2 Incorporating Transformation Invariances . . . . . . . . . . . . . . . . . 1004.3 Image Locality and Local Feature Extractors . . . . . . . . . . . . . . . 1094.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205 Conclusion 125A Object Databases 127B Object Recognition Results 1377

References

Page 67: A quick introduction to Statistical Learning Theory and Support Vector Machines

C Handwritten Character Databases 149D Technical Addenda 153D.1 Feature Space and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 153D.2 Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . 158D.3 On the Tangent Covariance Matrix . . . . . . . . . . . . . . . . . . . . 161Bibliography 165

References

Page 68: A quick introduction to Statistical Learning Theory and Support Vector Machines

AcknowledgementsFirst of all, I would like to express my gratitude to Prof. H. B�ultho�, Prof. S. J�ahnichen,and Prof. V. Vapnik for supervising the present dissertation, and to Prof. K. Ober-mayer for chairing the committee in the \Wissenschaftliche Aussprache". I am gratefulto Vladimir Vapnik for introducing me to the world of statistical learning theory dur-ing numerous extended discussions in his o�ce at AT&T Bell Laboratories. I havedeep respect for the completeness and depth of the body of theoretical work that heand his co-workers have created over the last 30 years. To Heinrich B�ultho�, I amgrateful for introducing me to the world of biological information processing, duringmy work on the Diplom and the doctoral dissertation. He created a unique researchatmosphere in his group at the Max-Planck-Institut f�ur biologische Kybernetik, andprovided excellent facilities without which the present work would not have been pos-sible. I would like to thank Stefan J�ahnichen for his advice, and for hosting me at theGMD Berlin during several research visits. A signi�cant amount of the reported workwas in uenced and carried out during these stays, where I closely collaborated withA. Smola and K.-R. M�uller.Thanks for �nancial support in the form of grants go to the Studienstiftung desdeutschen Volkes and the Max-Planck-Gesellschaft. In addition, it was the Studien-stiftung that made it possible in the �rst place that I got to know Vladimir Vapnik atAT&T in 1994, and that helped in getting A. Smola join the team in 1995.A number of people contributed to this dissertation in one way or another. Let mestart with V. Blanz, C. Burges, M. Franz, D. Herrmann, K.-R. M�uller, and A. Smola,who helped at the very end, in proofreading the manuscript, leading to many improve-ments of the exposition.The work for this thesis was done at several places, and each of the groups that Iwas working in deserves substantial credit. More than half of the time was spent at theMax-Planck-Institut f�ur biologische Kybernetik, and I would like to thank all membersof the group for providing a stimulating interdisciplinary research atmosphere, and forbearing with me when I maltreated their computers at night with my simulations.Special thanks go to the people in the Object Recognition group for a number of livelydiscussions, and to the small group of theoreticians at the MPI, who helped me invarious ways over the years.Almost one year of the time of my thesis work was spend in the adaptive systemsgroup at AT&T and Bell Laboratories. I learnt a lot about machine learning as appliedto real-world problems from all members of this excellent group. In particular, I would9

References

Page 69: A quick introduction to Statistical Learning Theory and Support Vector Machines

like to express my thanks to C. Burges, L. Bottou, C. Cortes, and I. Guyon for helpingme understand Support Vectors, and to L. Jackel, Y. LeCun, and C. Nohl for makingmy stays possible in the �rst place. In addition to their scienti�c advice, E. Cosatto,P. Ha�ner, E. S�ackinger, P. Simard and C. Watkins have helped me through theirfriendship. Finally, I want to express my gratitude for the possibility to use codeand databases developed and assembled by these people and their co-workers. Asubstantial part of this thesis would not have been possible without this.During my time in the USA, I also had the opportunity to spend a month atthe Center for Biological and Computational Learning (Massachusetts Institute ofTechnology), hosted by T. Poggio. I would like to thank him, as well as G. Geiger,F. Girosi, P. Niyogi, P. Sinha, and K. Sung, for hospitality and fruitful discussions.At the GMD, I had the possibility to interact with the local connectionists group,which (in addition to those mentioned already) included J. Kohlmorgen, N. Murata,and G. R�atsch. The present work pro�ted a great deal from my stays in Berlin.When starting to do research on one's own, one cannot help but noticing that themore specialized the �eld of work is, the more international and widespread seems to bethe group of people interested in it. Out of the scientists working on machine learningand perception, I want to thank J. Buhmann, S. Canu, A. Gammerman, J. Held,D. Kersten, J. Lemm, D. Leopold, P. Mamassian, G. Roth, S. Solla, F. Wichmann,and A. Yuille for stimulating discussions and advice.Without �rst studying science, it is hard to become a scientist. Studying sciencepredominantly means arguing about scienti�c problems. During my education, I wasin the favourable position to have enough people for scienti�c discussions. With manyof these friends and teachers, there is still contact and exchange of ideas. I wouldlike to thank all of them, and in particular G. Alli, C. Becker, V. Blanz, D. Cor�eld,H. Fischer, M. Franz, D. Henke, D. Herrmann, D. Janzing, U. Kappler, D. K�opf,F. Lutz, A. Rieckers, M. Schramm, and G. Sewell.Finally, without my parents, I would not even have studied anything in the �rstplace. Many thanks to them.References

Page 70: A quick introduction to Statistical Learning Theory and Support Vector Machines

SummaryLearning how to recognize patterns from examples gives rise to challenging theoreticalproblems: given a set of observations,� which of the observations should be used to construct the decision boundary?� which features should be extracted from each observation?� how can additional information about the decision function be incorporated inthe learning process?The present work is devoted to the above issues, studying Support Vectors in high-dimensional feature spaces, and Kernel PCA feature extraction.The material is organized as follows. We start with an introduction to the problemof pattern recognition, to concepts of statistical learing theory, and to feature spacesnonlinearly related to input space (Chapter 1). The paradigm for learning from ex-amples which is studied in this thesis, the Support Vector algorithm, is described inChapter 2, including empirical results obtained on realistic pattern recognition prob-lems. The latter in particular includes the �nding that the set of Support Vectorsextracted, i.e. those examples crucial for solving a given task, is largely independentof the type of Support Vector machine used. One speci�c topic in the developmentof Support Vector learning, the incorporation of prior knowledge, is studied in somedetail in Chapter 4: we describe three methods for improving classi�er accuracies bymaking use of transformation invariances and the local structure of images. Inter-twined between these two chapters, we propose a novel method for nonlinear featureextraction (Chapter 3), which works in the same types of features spaces as SupportVector machines, and which forms the basis of some developments of Chapter 4. Fi-nally, Chapter 5 gives a conclusion. As such, it partly reiterates what has just beensaid, and the reader who still remembers the present summary when arriving at Chap-ter 5 may �nd it amusing to contemplate whether the conclusion coincides with whathad been evoked in their mind by the summary that they have just �nished reading.11

References

Page 71: A quick introduction to Statistical Learning Theory and Support Vector Machines

Disclaimer. This thesis was written in an interdisciplinary research environment, andit was supervised by a statistician, a biologist, and a computer scientist. Accordingly,it attempts to be of interest for rather di�erent audiences. If your interests fall into oneof these categories exclusively, please bear with me: whenever you encounter a sectionwhich you �nd utterly useless, boring, or incomprehensible, there is the theoreticalpossibility that it is of interest to somebody else. Accordingly, please feel free toignore all these parts.Copyright Notice. Sections 2.4 and 2.6.1 are based on Sch�olkopf, Burges, and Vapnik(1995), AIII Press. Section 2.5 is based on Sch�olkopf, Sung, Burges, Girosi, Niyogi,Poggio, and Vapnik (1996c), IEEE. Chapter 3 is based on Sch�olkopf, Smola, and M�uller(1997b), MIT Press. Section 4.2.1 and �gures 2.5 and 4.1 are based on Sch�olkopf,Burges, and Vapnik (1996a), Springer Verlag. The author reserves for himself thenon-exclusive right to republish all other material.References

Page 72: A quick introduction to Statistical Learning Theory and Support Vector Machines

\To see a thing one has to comprehend it. An armchair presupposes thehuman body, its joints and limbs; a pair of scissors, the act of cutting.What can be said of a lamp or a car? The savage cannot comprehendthe missionary's Bible; the passenger does not see the same rigging as thesailors. If we really saw the world, maybe we would understand it."J. L. Borges, There are more things. In: The Book of Sand, 1979, Penguin,London.

References

Page 73: A quick introduction to Statistical Learning Theory and Support Vector Machines

References

Page 74: A quick introduction to Statistical Learning Theory and Support Vector Machines

Chapter 1Introduction and PreliminariesThe present work studies visual recognition problems from the point of view of learningtheory. This �rst chapter sets the scene for the main part of the thesis. It gives a briefintroduction of the problem of Learning Pattern Recognition from examples. The twomain contributions of this thesis are motivated in the conceptual part of the chapter.Section 1.1 discusses prior knowledge that might be available in addition to theset of training examples, and introduces the problem of extracting useful featuresfrom individual examples. The technical part of the chapter, Sec. 1.2, gives a concisedescription of some mathematical concepts of Statistical Learning Theory (Vapnik,1995b). This theory describes learning from examples as a problem of limited samplesize statistics and provides the basis for the Support Vector algorithm. Finally,Sec. 1.3 introduces mathematical concepts of feature spaces, which will be of centralimportance to all following chapters.1.1 Learning Pattern RecognitionLet us think of a pattern as an abstraction, de�ned by a collection of possible instancessuch as sample images. When trying to learn how to recognize a pattern, we face theproblem that we will often be unable to see all instances during learning, yet we want tobe able to recognize as many as possible. The extensive notion of a pattern that we justintroduced already suggests a speci�c approach to the problem of pattern recognition:a statistician tries to collect a large number of instances, and use inductive methodsto learn how to recognize them.For an alternative point of view, consider a pattern as something observable whichis generated by an underlying physical entity, as for instance the 2-D views of a 3-Dobject. To recognize a pattern of this nature, a physicist would try to understand thelaws governing the entity, and the mechanisms by which the pattern is brought about.In this process, it may turn out that di�erent observables, or functions thereof,contain di�erent amounts of information for understanding the underlying entity, i.e.it may be the case that from the initial raw observations, we have to extract usefulfeatures ourselves.The current work is located in the intersection of the aspects sketched in the15

References

Page 75: A quick introduction to Statistical Learning Theory and Support Vector Machines

16 CHAPTER 1. INTRODUCTION AND PRELIMINARIESabove three paragraphs. It studies an inductive learning algorithm which has beendeveloped in the framework of statistical learning theory, and it tries to enhance it byincorporating prior knowledge about a recognition task at hand. Finally, it studiesthe extraction of features for the purpose of recognition.Even though pattern recognition is not limited to the visual domain, we shall focuson visual recognition. Much of what is said in this thesis, however, would equallyapply to the recognition of acoustic patterns, say.In the remainder of this section, we introduce the terminology which is used indiscussing di�erent aspects of visual recognition problems: these are, in turn, thedata, the tasks, and the methods for recognition.Data. Di�erent types of pattern recognition problems make di�erent types of as-sumptions about the underlying causes generating the patterns. Nevertheless, it ispossible to discuss them in a common framework which we try to describe presently.It draws from machine learning terminology; as such, it will di�er from psychologicalusage of the relevant terms in some respects.1Observers visually perceive views. Sets of views constitute classes. Sometimes,classes have a structure that goes beyond being mere collections of views. For instance,the class of all views of rainbows has the property that if a speci�c view belongs toit, then so do all views which are generated by translating it, parallel to the horizon.Objects are speci�c classes, with a rich class structure, containing for instance all viewtransformations corresponding to rigid 3-D transformations of a speci�c underlyingphysical entity. Some of these transformations are shared by all objects, for instancetranslations; others, like deformations, are object-speci�c.More radically, and fundamentally view-based, we could give up the notion of pri-ority of the underlying physical entities, and think of an object only as a collectionof views, with a speci�c class structure. On a practical level, this is the approachpursued in the current work. The distinction between objects and other classes thenbecomes a distinction between di�erent types of transformation invariances. For in-stance, a rainbow would not be an object, as we cannot possibly see it from above, noteven with a spacecraft. The class of handwritten digits '6' would not be an object forsimilar reasons; in fact, an image plane rotation by 180� would even take us into theclass '9'. As an aside, we note that mathematics and physics have already undergonea paradigm shift away from the notion of objects as \things in the world", towardsstudying their transformation properties. In mathematics, this is exempli�ed in FelixKlein's Erlanger Programm (Klein, 1872) which shifts geometry away from points andlines towards transformation groups; in physics, an example is the modern de�nitionof elementary particles as transformation group representations (e.g. Primas, 1983).Kac and Ulam (1968) refer to this as\[...] the immensely powerful and fruitful idea that much can be learned1The ideas put forward in the following were in uenced by discussions with people in the MPI'sobject recognition group, in particular with V. Blanz.

References

Page 76: A quick introduction to Statistical Learning Theory and Support Vector Machines

1.1. LEARNING PATTERN RECOGNITION 17about the structure of certain objects by merely studying their behaviourunder the action of certain groups."Later in the thesis, the reader will encounter methods for improving visual recognitionsystems by taking into account transformation properties of handwritten charactersand 3-D objects (Sec. 4.2).Prior Knowledge. The statistical approach of learning from examples in its pure formneglects the additional knowledge of class structure described above. However, thelatter, referred to as Prior Knowledge, can be of great practical relevance in recognitionproblems.Suppose we were given temporal sequences of detailed observations (including spec-tra) of double star systems, and we would like to predict whether, eventually, one ofthe stars will collapse into a black hole. Given a small set of observations of di�erentdouble star systems, including target values indicating the eventual outcome (sup-posing these were available), a purely statistical approach of learning from exampleswould probably have di�culties extracting the desired dependency. A physicist, onthe other hand, would infer the star masses from the spectra's periodicity and Dopplershifts, and use the theory of general relativity to predict the eventual fate of the stars.Of course, one could argue that the physicist's model of the situation is based ona huge body of prior examples of situations and phenomena which are related to theabove in one way or another. This, however, is exactly how the term prior knowledgeshould be understood in the present context. It does not refer to a Kantian a priori,as prior to all experience, but to what is prior to a given problem of learning fromexamples.What do we do, however, if we do not have a dynamical model of what is happeningbehind the scenes? In this case, which for instance applies whenever the underlyingdynamics is too complicated, the strengths of the purely statistical approach becomeapparent. Let us consider the case of handwritten character recognition. When ahuman writer decides to write the letter 'A', the actual outcome is the result of a seriesof complicated processes, which in their entirety cannot be modelled comprehensively.The intensity of the lines depends on chemical properties of ink and paper, their shapeon the friction between pencil and the paper, on the dynamics of the writer's joints,and on motor programmes initiated in the brain, these in turn are based on what thewriter has learnt at school | the chain could be continued ad in�nitum. Accordingly,nobody tries to recognize characters by completely modelling their generation.However, the lack of a complete dynamical model does not mean that there isno prior knowledge in handwritten digit recognition. For instance, we know thathandwritten digits do not change their class membership if they are translated on apage, or if their line thickness is slightly changed. This type of knowledge can be usedto augment the purely statistical approach (Sec. 4.2). More abstract prior knowledgein many visual recognition tasks includes the fact that the correlations between nearbyimage locations are often more reliable features for recognition than those with largerdistances (Sec. 4.3).

References

Page 77: A quick introduction to Statistical Learning Theory and Support Vector Machines

18 CHAPTER 1. INTRODUCTION AND PRELIMINARIESFeatures. Before we proceed to the tasks that can be performed, depending on theavailable data, we need to introduce a concept widely used in both statistics and inthe analysis of human perception. In its general form, a feature detector or featureextractor is a function which assigns a (typically scalar) value to each raw observation.Often, a number of di�erent such functions are applied to the observations in a featureextraction process, leading to a preprocessed vector representation of the data. Thegoal of extracting features is to improve subsequent stages of processing, be it byimproving accuracies in a recognition task, or by reducing storage requirements orprocessing time.The feature functions serving this purpose can either be speci�ed in advance, forinstance in a way such that they incorporate prior knowledge about a problem at hand,or computed from the set of observations. Both approaches, as well as combinationsthereof, shall be addressed in this thesis (Chapter 3, Sec. 4.2.2, Sec. 4.3).The actual term feature is used with di�erent meanings. In vision research andpsychophysics, it is mainly used for the optimal stimulus of a corresponding featuredetector. However, note that given a nonlinear feature detector, it may be practicallyimpossible to determine this optimal stimulus. In statistics, on the other hand, theterm feature mostly refers to the feature values, i.e. the outputs of feature detectors, orto the feature detector itself. Possibly, this ambiguity arose from the fact that, in somecases, the di�erent meanings coincide: in the case where the feature detector consistsof a dot product with a weight vector, as in linear neural network model receptive�elds, the optimal stimulus is aligned with the weight vector, and thus the two can beidenti�ed.Tasks. Suppose we are only given solitary views, and neither nontrivial classes norobjects (which were structured collections of views). Then out of the tasks of dis-crimination, classi�cation, and identi�cation, only discrimination can be carried outon these views. This does, however, not prevent the term discrimination from beingused also in the context of classes and objects. Discrimination, the mere detection ofa di�erence, can be preceded by feature extraction processes; in these cases, resultswill depend on the extraction process used.Classi�cation consists of attributing views to classes, and thus requires the existenceof classes. These can be speci�ed abstractly | by describing features, or Gibsoniana�ordances (\something to sit on"), e.g. | or provided (approximately) by a sampleof training views. This de�nition of classes by training sets is widespread in machinelearning; it will also be the paradigm that we are going to use in this thesis. One talksabout yes-no and old-new classi�cation tasks (one speci�ed class) or naming tasks(several classes). Pattern recognition problems like Handwritten Digit Recognitionare examples of classi�cation.Similarly, identi�cation consists in determining to which object a presented viewbelongs. As objects are special types of classes, we again have the possibility for theabove tasks. Identi�cation makes sense only for objects: for instance, it is meaninglessto ask whether the rainbow we see today is a view of the same (object as a) rainbow

References

Page 78: A quick introduction to Statistical Learning Theory and Support Vector Machines

1.1. LEARNING PATTERN RECOGNITION 19we saw last year.In this thesis, we study classi�cation and identi�cation. Often, both of these tasksare referred to as recognition, the term which we shall mostly employ. Indeed, whenclasses are given by training sets, the question whether there is an underlying objectproducing the observed views becomes secondary. It is then only of relevance insofaras it determines the type of prior knowledge available.Human Object Recognition. The position that object recognition is not about re-covering physical 3-D entities, but about learning their views, and potentially also theirtransformation properties, can be supported by biological and psychological evidence.B�ultho� and Edelman (1992) have shown that when recognizing unfamiliar objects,observers exhibit viewpoint e�ects which suggest that they do not recover the 3-Dshape of objects, but rather build a representation based on the actual training views(cf. also Logothetis, Pauls, and Poggio, 1995). They thought of this representationas an interpolation mechanism (cf. Poggio and Girosi, 1990), but one could of courseconceive of more sophisticated mechanisms for combining information contained inthe training views. In the above terminology, one might argue that due to their un-familiarity, the wire frame objects of B�ultho� and Edelman (1992) make it very hardto use the transformations which form the structure of the underlying class of views.Ullman (1996) has put forward a multiple-view variant of his theory of \recognitionby alignment", where objects are recognized by aligning them with stored view tem-plates. The alignment process can make use of certain transformations speci�c to theobject in question. The results of Troje and B�ultho� (1996) have shown that thesetransformations in some cases directly operate on 2-D views, and that they are muchsimpler than transformations using an underlying 3-D model: in experiments probingface recognition under varying poses, observers performed better on views which wereobtained by simply applying a mirror reversal transformation to a previously seenview, rather than by rotating the head in depth to generate the true view of the otherside. Rao and Ballard (1997) recently proposed a model in which the \what" and the\where" pathway (Mishkin, Ungerleider, and Macko, 1983) in the visual system areconceived of as estimating object identity and transformations, respectively. Using acollection of patches taken from natural images, they construct a generative model forthe data which learns, transforms and linearly combines simple basis functions. Theirmodel, however, does not directly make use of the valuable information contained inthe temporal stream of visual data: comparing subsequent images, e.g. by optic owtechniques, would give a more direct means of constructing processing elements en-coding transformations. Indeed, in the dorsal stream (the \where" pathway), neuronshave been found coding for various types of large-�eld transformations (Du�y andWurtz, 1991). Of somewhat related interest are the large-�eld neurons in the y's vi-sual system, coding for speci�c ow �elds which are generated by the y's movementin the environment (Krapp and Hengstenberg, 1996).

References

Page 79: A quick introduction to Statistical Learning Theory and Support Vector Machines

20 CHAPTER 1. INTRODUCTION AND PRELIMINARIESRepresentations and Processes. The above illustrates that the question of how,given a recognition problem, the actual processing can be performed, is intimatelyrelated to underlying representations (of classes or objects), computed by some featureextraction process. A representation should satisfy certain constraints in terms ofstorage cost, computational cost and accuracy.General classes without structure are not compressible (except for a separate com-pression of the individual images). Classes with some internal structure can be com-pressed, hence a smaller representation is possible, which in turn makes generalizationto novel views possible (cf. Kolmogorov, 1965; Rissanen, 1978). This is the underlyingcomputational reason for the constructive nature of perception.If we can generate a class (e.g. an object) from some prototype views using aspeci�ed set of transformations, we can represent it as the set of prototypes plustransformations. The more prototypes we store, the less complex are the transfor-mations that we need to remember. In this sense, there is a continuum of di�erentview-based approaches. In principle, further compression is conceivable if we allowfor the construction of a suitable underlying representation. E.g., Ullman's approachof storing a 3-D model plus the set of 3-D transformations (Ullman, 1989) is cheapin terms of storage: storing these transformations is almost for free, and storing one3-D model is reasonably cheap. Constructing this representation, however, may becomputationally quite expensive. Reading out and matching 2-D views, on the otherhand, is computationally rather cheap if done in parallel neural architecture. The typeof representation to be used should thus depend on the task, e.g. on speed constraints.Indeed, proponents of view-based object recognition theories are mainly concernedwith fast recognition tasks (B�ultho� and Edelman, 1992). Moreover, the storage coststrongly depends on the task and the type of feature extraction applied to the rawdata.In some cases, we can extract features from views which allow reasonably highrecognition accuracies while enabling us to work with much simpler sets of transfor-mations. For instance, if there exists a diagnostic object feature which is visible fromall viewpoints, we only need to store the feature (e.g. the colour), the extraction pro-cess (which can be thought of as a speci�c image transformation which needs to bestored), and the fact that it may occur anywhere in the view (i.e. the set of all imageplane translations).Clearly, the set of features which are extracted from views in uences all furtherprocessing. Applied to our setting, constructing a feature representation consists oftwo parts: the features have to be extracted from a possibly large set of views, andthe transformations which connect features belonging to views of the same class haveto be computed. This may require a trade-o�: for some feature representations, theextraction process is di�cult (e.g. using correspondence methods, Beymer and Poggio,1996; Vetter and Troje, 1997), whereas the computations of transformations might besimple. A similar trade-o� exists in utilizing such a representation: for a recognitiontask done by matching, e.g., we would have to extract features from test views, andtransform them to match stored ones. Put in machine learning language, features

References

Page 80: A quick introduction to Statistical Learning Theory and Support Vector Machines

1.2. STATISTICAL LEARNING THEORY 21should be used which allow solving a given task within speci�ed limits on trainingtime, testing speed, error rate, and memory requirements.Implementations. So far, not much has been said about actual implementations ofrecognition systems. The present work focuses on algorithmic questions rather thanon questions of implementation, both with respect to the computational side and withrespect to the biological side of the recognition problem. The former normally neednot be justi�ed: in statistics, scienti�c studies of mere algorithms, without discussionof implementation details, are abundant. In biology, which is the main focus of interestin the group where much of the present work was carried out, the type of abstractionpresented here is much less common. Indeed, the relevance of this thesis to biologicalpattern recognition is on the level of statistical properties of problems and algorithms| not more, and not less. In our hope that this type of theoretical work should be ofinterest to people studying the brain, we concur with Barlow (1995):\If arti�cial neural nets, designed to imitate cognitive functions of thebrain, are truly performing tasks that are best formulated in statisticalterms, then is this not likely also to be true of cognitive function in gen-eral? The idea that the brain is an accomplished statistical decision-makingorgan agrees well with notions to be sketched in the last section of this[Barlow's, the author] article."To study object recognition from a statistical point of view, we shall in the followingsection brie y review some of the basic concepts and results of statistical learningtheory.1.2 Statistical Learning TheoryOut of the considerable body of theory that has been developed in statistical learningtheory by Vapnik and others (e.g. Vapnik and Chervonenkis, 1968, 1974; Vapnik, 1979,1995a,b), we brie y review a few concepts and results which are necessary in order tobe able to appreciate the Support Vector learning algorithm, which will be used in asubstantial part of the thesis.2For the case of two-class pattern recognition, the task of learning from examplescan be formulated in the following way: we are given a set of functionsff� : � 2 �g; f� : RN ! f�1g (1.1)and a set of examples, i.e. pairs of patterns xi and labels yi,(x1; y1); : : : ; (x`; y`) 2 RN � f�1g; (1.2)2A high-level summary is given in (Sch�olkopf, 1996).

References

Page 81: A quick introduction to Statistical Learning Theory and Support Vector Machines

22 CHAPTER 1. INTRODUCTION AND PRELIMINARIESeach one of them generated from an unknown probability distribution P (x; y) con-taining the underlying dependency. (Here and below, bold face characters denotevectors.) We want to learn a function f�� which provides the smallest possible valuefor the average error committed on independent examples randomly drawn from thesame distribution P , called the riskR(�) = Z 12 jf�(x)� yj dP (x; y): (1.3)The problem is that R(�) is unknown, since P (x; y) is unknown. Therefore an induc-tion principle for risk minimization is necessary.The straightforward approach to minimize the empirical riskRemp(�) = 1 Xi=1 12 jf�(xi)� yij (1.4)turns out not to guarantee a small actual risk, if the number ` of training examplesis limited. In other words: a small error on the training set does not necessarilyimply a high generalization ability (i.e. a small error on an independent test set).This phenomenon is often referred to as over�tting (e.g. Bishop, 1995). To make themost out of a limited amount of data, novel statistical techniques have been developedduring the last 30 years. The Structural Risk Minimization principle (Vapnik, 1979)is based on the fact that for the above learning problem, for any � 2 � and ` > h,with a probability of at least 1� �, the boundR(�) � Remp(�) + � h ; log(�)` ! (1.5)holds, where the con�dence term � is de�ned as� h ; log(�)` ! = vuuth �log 2h + 1�� log(�=4)` : (1.6)The parameter h is called the VC(Vapnik-Chervonenkis)-dimension of a set of func-tions. It describes the capacity of a set of functions. For binary classi�cation, h is themaximal number of points which can be separated into two classes in all possible 2hways by using functions of the learning machine; i.e. for each possible separation thereexists a function which takes the value 1 on one class and �1 on the other class.A learning machine can be thought of as a set of functions (that the machine hasat its disposal), an induction principle, and an algorithmic procedure for implementingthe induction principle on the given set of functions. Often, the term learning machineis used to refer to its set of functions | in this sense, we talk about the capacity orVC-dimension of learning machines.

References

Page 82: A quick introduction to Statistical Learning Theory and Support Vector Machines

1.2. STATISTICAL LEARNING THEORY 23The bound (1.5), which forms part of the theoretical basis for Support Vectorlearning, deserves some further explanatory remarks.Suppose we wanted to learn a \dependency" where P (x; y) = P (x)�P (y), i.e. wherethe pattern x contains no information about the label y, with uniform P (y). Given atraining sample of �xed size, we can then surely come up with a learning machine whichachieves zero training error. However, in order to reproduce the random labellings,this machine will necessarily require a VC-dimension which is large compared to thesample size. Thus, the con�dence term (1.6), increasing monotonically with h=`, willbe large, and the bound (1.5) will not support possible hopes that due to the smalltraining error, we should expect a small test error. This makes it understandable how(1.5) can hold independent of assumptions about the underlying distribution P (x; y):it always holds, but it does not always make a nontrivial prediction | a bound on anerror rate becomes void if it is larger than the maximum error rate. In order to getnontrivial predictions from (1.5), the function space must be restricted such that theVC-dimension is small enough (in relation to the available amount of data).3According to (1.5), given a �xed number ` of training examples, one can controlthe risk by controlling two quantities: Remp(�) and h(ff� : � 2 �0g), �0 denotingsome subset of the index set �. The empirical risk depends on the function chosenby the learning machine (i.e. on �), and it can be controlled by picking the right �.The VC-dimension h depends on the set of functions ff� : � 2 �0g which the learningmachine can implement. To control h, one introduces a structure of nested subsetsSn := ff� : � 2 �ng of ff� : � 2 �g,S1 � S2 � : : : � Sn � : : : ; (1.8)whose VC-dimensions, as a result, satisfyh1 � h2 � : : : � hn � : : : (1.9)For a given set of observations (x1; y1); :::; (x`; y`) the Structural Risk Minimizationprinciple chooses the function f�n in the subset ff� : � 2 �ng for which the guaranteed3The bound (1.5), formulated in terms of the VC-dimension, is only the last element of a series oftighter bounds which are formulated in terms of other concepts. This is due to the inequalitiesH�(`) � H�ann(`) � G�(`) � h�log 2h + 1� ; (` > h): (1.7)The VC-dimension h is probably the most-used and best-known concept in this row. However, theother ones lead to tighter bounds, and also play important roles in the conceptual part of statisticallearning theory: the VC-entropy H� and the Annealed VC-entropy H�ann are used to formulateconditions for the consistency of the empirical risk minimization principle, and for a fast rate ofconvergence, respectively. The Growth function G� provides both of the above, independently ofthe underlying probability measure P , i.e. independently of the data. The VC-dimension h, �nally,provides a constructive upper bound on the Growth function, which can be used to design learningmachines (for details, see Vapnik, 1995b).

References

Page 83: A quick introduction to Statistical Learning Theory and Support Vector Machines

24 CHAPTER 1. INTRODUCTION AND PRELIMINARIESR(fαn)

h

training error

confidence term

error

Sn−1 Sn Sn+1

structure

bound on test error

FIGURE 1.1: Graphical depiction of (1.5), for �xed `. A learning machine with largercomplexity, i.e. a larger set of functions Sn, allows for a smaller training error; a lesscomplex learning machine, with a smaller Si, has smaller VC-dimension and thus providesa smaller con�dence term � (cf. (1.6)). Structural Risk Minimization picks a trade-o� inbetween these two cases by choosing the function of the learning machine f�n such thatthe risk bound (1.5) is minimal.risk bound (the right hand side of (1.5)) is minimal (cf. Fig. 1.1). The procedure ofselecting the right subset for a given amount of observations is referred to as capacitycontrol.We conclude this section by noting that analyses in other branches of learningtheory have led to similar insights in the trade-o� between reducing the training er-ror and limiting model complexity, for instance as described by regularization theory(Tikhonov and Arsenin, 1977), Minimum Description Length (Rissanen, 1978; Kol-mogorov, 1965), or the Bias-Variance Dilemma (Geman, Bienenstock, and Doursat,1992). Haykin (1994); Ripley (1996) give overviews in the context of Neural Networks.1.3 Feature Space MathematicsThe present section summarizes some mathematical preliminaries which are essentialfor both Support Vector machines (Chapter 2) and nonlinear Kernel Principal Com-

References

Page 84: A quick introduction to Statistical Learning Theory and Support Vector Machines

1.3. FEATURE SPACE MATHEMATICS 25ponent Analysis (Chapter 3).1.3.1 Product FeaturesSuppose we are given patterns x 2 RN where most information is contained in thed-th order products (monomials) of entries xj of x,xj1 � : : : � xjd; (1.10)where j1; : : : ; jd 2 f1; : : : ; Ng. In that case, we might prefer to extract these productfeatures �rst, and work in the feature space F of all products of d entries. In visualrecognition problems, e.g., this would amount to extracting features which are productsof individual pixels.For instance, in R2, we can collect all monomial feature extractors of degree 2 inthe nonlinear map � : R2 ! F = R3 (1.11)(x1; x2) 7! (x21; x22; x1x2): (1.12)This approach works �ne for small toy examples, but it fails for realistically sizedproblems: for N -dimensional input patterns, there existNF = (N + d� 1)!d!(N � 1)! (1.13)di�erent monomials (1.10), comprising a feature space F of dimensionality NF . Al-ready 16� 16 pixel input images and a monomial degree d = 5 yield a dimensionalityof 1010.In certain cases described below, there exists, however, a way of computing dotproducts in these high-dimensional feature spaces without explicitely mapping intothem: by means of nonlinear kernels in input space RN . Thus, if the subsequentprocessing can be carried out using dot products exclusively, we are able to deal withthe high dimensionality.The following section describes how dot products in polynomial feature spaces canbe computed e�ciently, followed by a section which discusses more general featurespaces.1.3.2 Polynomial Feature Spaces Induced by KernelsIn order to compute dot products of the form (�(x) � �(y)), we employ kernel repre-sentations of the form k(x;y) = (�(x) � �(y)); (1.14)which allow us to compute the value of the dot product in F without having to carryout the map �. This method was used by Boser, Guyon, and Vapnik (1992) to extend

References

Page 85: A quick introduction to Statistical Learning Theory and Support Vector Machines

26 CHAPTER 1. INTRODUCTION AND PRELIMINARIESthe Generalized Portrait hyperplane classi�er of Vapnik and Chervonenkis (1974) tononlinear Support Vector machines (Sec. 2.1). Aizerman, Braverman, and Rozonoer(1964) call F the linearization space, and use it in the context of the potential functionclassi�cation method to express the dot product between elements of F in terms ofelements of the input space. They also consider the possibility of choosing k a priori,without being directly concerned with the corresponding mapping � into F . A speci�cchoice of k might then correspond to a dot product between patterns mapped with asuitable �.What does k look like for the case of polynomial features? We start by giving anexample (Vapnik, 1995b) for N = d = 2. For the mapC2 : (x1; x2) 7! (x21; x22; x1x2; x2x1); (1.15)dot products in F take the form(C2(x) � C2(y)) = x21y21 + x22y22 + 2x1x2y1y2 = (x � y)2; (1.16)i.e. the desired kernel k is simply the square of the dot product in input space. Boser,Guyon, and Vapnik (1992) note that the same works for arbitrary N; d 2 N: as astraightforward generalization of a result proved in the context of polynomial approx-imation (Poggio, 1975, Lemma 2.1), we have:Proposition 1.3.1 De�ne Cd to map x 2 RN to the vector Cd(x) whose entries areall possible d-th degree ordered products of the entries of x. Then the correspondingkernel computing the dot product of vectors mapped by Cd isk(x;y) = (Cd(x) � Cd(y)) = (x � y)d: (1.17)Proof. We directly compute(Cd(x) � Cd(y)) = NXj1;:::;jd=1 xj1 � : : : � xjd � yj1 � : : : � yjd (1.18)= 0@ NXj=1xj � yj1Ad = (x � y)d: (1.19)ut Instead of ordered products, we can use unordered ones to obtain a map �d whichyields the same value of the dot product. To this end, we have to compensate forthe multiple occurence of certain monomials in Cd by scaling the respective monomial

References

Page 86: A quick introduction to Statistical Learning Theory and Support Vector Machines

1.3. FEATURE SPACE MATHEMATICS 27entries of �d with the square roots of their numbers of occurence. Then, by thisde�nition of �d, and (1.17),(�d(x) ��d(y)) = (Cd(x) � Cd(y)) = (x � y)d: (1.20)For instance, if n of the ji in (1.10) are equal, and the remaining ones are di�erent,then the coe�cient in the corresponding component of �d is q(d� n+ 1)! (for thegeneral case, cf. Smola, Sch�olkopf, and M�uller, 1997). For �2, this simply means that(Vapnik, 1995b) �2(x) = (x21; x22;p2 x1x2): (1.21)If x represents an image with the entries being pixel values, we can use the kernel(x � y)d to work in the space spanned by products of any d pixels | provided that weare able to do our work solely in terms of dot products, without any explicit usage of amapped pattern �d(x). Using kernels of the form (1.17), we take into account higher-order statistics without the combinatorial explosion (cf. (1.13)) of time and memorycomplexity which goes along already with moderately high N and d.To conclude this section, note that it is possible to modify (1.17) such that it mapsinto the space of all monomials up to degree d, de�ning (Vapnik, 1995b)k(x;y) = (x � y+ 1)d: (1.22)1.3.3 Feature Spaces Induced by Mercer KernelsThe question which function k does correspond to a dot product in some space F hasbeen discussed by Boser, Guyon, and Vapnik (1992); Vapnik (1995b). To construct amap � induced by a kernel k, i.e. a map � such that k computes the dot product inthe space that � maps to, they use Mercer's theorem of functional analysis (Courantand Hilbert, 1953):Proposition 1.3.2 If k is a continuous symmetric kernel of a positive4 integral op-erator K, i.e. (Kf)(y) = ZC k(x;y)f(x) dx (1.23)with ZC�C k(x;y)f(x)f(y) dx dy � 0 (1.24)for all f 2 L2(C) (C being a compact subset of RN), it can be expanded in a uniformly4When referring to operators, the term positive is always meant in the sense stated here. If wetalk about positive de�nite operators, we will express this explicitely.

References

Page 87: A quick introduction to Statistical Learning Theory and Support Vector Machines

28 CHAPTER 1. INTRODUCTION AND PRELIMINARIESconvergent series (on C � C) in terms of Eigenfunctions j and positive Eigenvalues�j, k(x;y) = NFXj=1�j j(x) j(y); (1.25)where NF � 1.Note that originally proven for the case where C = [a; b], this Proposition also holdstrue for general compact spaces (Dunford and Schwartz, 1963).For the converse of Proposition 1.3.2, cf. Appendix D.1.From (1.25), it is straightforward to construct a map �, mapping into a potentiallyin�nite-dimensional l2 space, which does the job. For instance, we may use� : x 7! (q�1 1(x);q�2 2(x); : : :): (1.26)We thus have the following result (Boser, Guyon, and Vapnik, 1992):5Proposition 1.3.3 If k is a continuous kernel of a positive integral operator (condi-tions as in Proposition 1.3.2), one can construct a mapping � into a space where kacts as a dot product, (�(x) � �(y)) = k(x;y): (1.27)Besides (1.17), Boser, Guyon, and Vapnik (1992) and Vapnik (1995b) suggest theusage of Gaussian radial basis function kernels (Aizerman, Braverman, and Rozonoer,1964) k(x;y) = exp �kx� yk22 �2 ! (1.28)and sigmoid kernels k(x;y) = tanh(�(x � y) + �): (1.29)Note that all these kernels have the convenient property of unitary invariance, i.e.k(x;y) = k(Ux; Uy) if U> = U�1 (if we consider complex numbers, then U� insteadof U> has to be used). The radial basis function kernel additionally is translationinvariant.5In order to identify k with a dot product in another space, it would be su�cient to have pointwiseconvergence of (1.25). Uniform convergence lets us make an assertion which goes further: given anaccuracy level � > 0, there exists an n 2 N such that even if the range of � is in�nite-dimensional, kcan be approximated within accuracy � as a dot product in Rn, between images of�n : x 7! (p�1 1(x); : : : ;p�n n(x)):

References

Page 88: A quick introduction to Statistical Learning Theory and Support Vector Machines

1.3. FEATURE SPACE MATHEMATICS 291.3.4 The Connection to Reproducing Kernel Hilbert SpacesThe feature space that � maps into is a reproducing kernel Hilbert space (RKHS).To see this, we follow Wahba (1973) and recall that a RKHS is a Hilbert space offunctions f on some set C such that all evaluation functionals f 7! f(y) (y 2 C) arecontinuous. In that case, by the Riesz representation theorem (e.g. Reed and Simon,1980), for each y 2 C there exists a unique function of x, call it k(x;y), such thatf(y) = hf; k(:;y)i (1.30)(here, k(:;y) is the function on C obtained by �xing the second argument of k toy, and h:; :i is the dot product of the RKHS). In view of this property, k is called areproducing kernel.Note that by (1.30), hf; k(:;y)i = 0 for all y implies that f is identically zero.Hence the set of functions fk(:;y) : y 2 Cg spans the whole RKHS. The dot producton the RKHS thus only needs to be de�ned on fk(:;y) : y 2 Cg and can then beextended to the whole RKHS by linearity and continuity. From (1.30), it follows thatin particular hk(:;x); k(:;y)i = k(y;x) (1.31)for all x;y 2 C (this implies that k is symmetric). Note that this means that anyreproducing kernel k corresponds to a dot product in another space.To establish a connection to the dot product in a feature space F , we next assumethat k is a Mercer kernel (cf. Proposition 1.3.2). First note that it is possible toconstruct a dot product such that k becomes a reproducing kernel for a Hilbert spaceof functions f(x) = 1Xi=1 aik(x;xi) = 1Xi=1 ai NFXj=1�j j(x) j(xi): (1.32)Using only linearity, which holds for any dot product h:; :i, we havehf; k(:;y)i = 1Xi=1 ai NFXj;n=1�j j(xi)h j; ni�n n(y): (1.33)Since k is a symmetric kernel, the i (i = 1; : : : ; NF ) can be chosen to be orthogonalwith respect to the dot product in L2(C). Hence it is straightforward to construct adot product h:; :i such that h j; ni = �jn=�j (1.34)(using the Kronecker symbol �jn), in which case (1.33) reduces to the reproducingkernel property (1.30) (using (1.32)).

References

Page 89: A quick introduction to Statistical Learning Theory and Support Vector Machines

30 CHAPTER 1. INTRODUCTION AND PRELIMINARIESTo write h:; :i as a dot product of coordinate vectors, we thus only need to expressthe functions of the RKHS in the basis (p�n n)n=1;:::;NF , which is orthonormal withrespect to h:; :i, i.e. f(x) = NFXn=1�nq�n n(x): (1.35)To obtain the coordinates �n, we compute, using (1.34),�n = hf;q�n ni = h 1Xi=1 ai NFXj=1�j j(xi) j;q�n ni = q�n 1Xi=1 ai n(xi): (1.36)Comparing (1.35) and (1.26), we see that F has the structure of a RKHS in the sensethat for f and g given by (1.35) andg(x) = NFXj=1�jq�j j(x); (1.37)we have (� � �) = hf; gi: (1.38)Note, moreover, that due to (1.35), we have f(x) = (� � �(x)) in F . Comparing to(1.30), this shows that �(x) is nothing but the coordinate representation of the kernelas a function of one argument (cf. also (1.27)).To conclude the brief detour into RKHS theory, note that in (1.30), k does nothave to be linear in its arguments; however, its action as an evaluation functional inHilbert space is linear | this is the underlying reason why Mercer kernels computebilinear dot products in Hilbert spaces: the dot product is obtained by combining twoevaluations of a possibly nonlinear function in a suitable Hilbert space.1.3.5 Kernel Values as Pairwise SimilaritiesIn practice, we are given a �nite amount of data x1; : : : ;x`. The following simpleobservation shows that even if we do not want to (or are unable to) analyse a givenkernel k analytically, we can still compute a map � such that k corresponds to a dotproduct in the linear span of the �(xi):Proposition 1.3.4 Suppose the data x1; : : : ;x` and the kernel k are such that thematrix Kij = k(xi;xj) (1.39)is positive. Then it is possible to construct a map � into a feature space F such thatk(xi;xj) = (�(xi) � �(xj)): (1.40)

References

Page 90: A quick introduction to Statistical Learning Theory and Support Vector Machines

1.3. FEATURE SPACE MATHEMATICS 31Conversely, for a map � into some feature space F , the matrix Kij = (�(xi) � �(xj))is positive.Proof. Being positive, K can be diagonalized asK = SDS> (1.41)with an orthogonal matrix S and a diagonal matrix D with nonnegative entries. Thenk(xi;xj) = (SDS>)ij (1.42)= Xk=1SikDkk �S>�kj (1.43)= Xk=1SikDkkSjk (1.44)= (si �Dsj); (1.45)where we have de�ned the si as the rows of S (note that the columns of S would beK's Eigenvectors). Therefore, K is the dot product matrix (or Gram matrix) of thevectors pDkk � si.6 Hence the map �, de�ned on the xi by� : xi 7! qDkk � si; (1.46)does the job (cf. (1.40)).Note that if the xi are linearly dependent, it will typically not be the case that �can be extended to a linear map.For the converse, assume an arbitrary � 2 R`, and computeXi;j=1�i�jKij = 0@Xi=1 �i�(xi) � Xj=1�j�(xj)1A � 0: (1.47)utIn particular, this result implies that given data x1; : : : ;x`, and a kernel k which givesrise to a positive matrix K, it is always possible to construct a feature space F ofdimensionality � ` that we are implicitely working in when using kernels.If we perform an algorithm which requires k to correspond to a dot product in someother space (as for instance the Support Vector algorithm to be described below), itcould happen that even though k does not satisfy Mercer's conditions in general,it still gives rise to a positive matrix K for the given training data. In that case,6The fact that every positive matrix is the Gram matrix of some set of vectors is well-known inlinear algebra (see e.g. Bhatia, 1997, Exercise I.5.10).

References

Page 91: A quick introduction to Statistical Learning Theory and Support Vector Machines

32 CHAPTER 1. INTRODUCTION AND PRELIMINARIESProposition 1.3.4 tells us that nothing will go wrong during training when we workwith these data. Moreover, if a kernel leads to a matrix K with some small negativeEigenvalues, we can add a small multiple of some positive de�nite kernel to obtain apositive matrix.7Note, �nally, that Proposition 1.3.4 does not require the x1; : : : ;x` to be elementsof a vector space. They could be any set of objects which, for some function k (whichcould be thought of as a similarity measure for the objects), gives rise to a positivematrix (k(xi;xj))ij. Methods based on pairwise distances or similarities have recentlyattracted attention (Hofmann and Buhmann, 1997). They have the advantage of beingapplicable also in cases where it is hard to come up with a sensible vector representationof the data (e.g. in text clustering).

7For instance, for the hyperbolic tangent kernel (1.29), Mercer's conditions have not been veri�ed.It does not satisfy them in general: in a series of experiments with 2-D toy data, we noticed that thedot product matrix in K had some negative Eigenvalues, for most choices of � that we investigated(except for large negative values). Nevertheless, this kernel has successfully been used in SupportVector learning (cf. Sec. 2.3). To understand the latter, note that by shifting the kernel (i.e. choosingdi�erent values of �), one can approximate the shape of the polynomial kernel (which is known tobe positive), as a function of (x � y) (within a certain range), up to a vertical o�set. This o�set isirrelevant in SV learning: due to (2.15), adding a constant to all elements of the dot product matrixdoes not change the solution.

References

Page 92: A quick introduction to Statistical Learning Theory and Support Vector Machines

Chapter 2Support Vector MachinesThis chapter discusses theoretical and empirical issues related to the Support Vector(SV) algorithm. This algorithm, reviewed in Sec. 2.1, is based on the results oflearning theory outlined in Sec. 1.2. Via the use of kernel functions (Sec. 1.3), it givesrise to a number of di�erent types of pattern classi�ers (Vapnik and Chervonenkis,1974; Boser, Guyon, and Vapnik, 1992; Cortes and Vapnik, 1995; Vapnik, 1995b).The original contribution of the present chapter is largely empirical. Using ob-ject and digit recognition tasks, we show that the algorithm allows us to constructhigh-accuracy polynomial classi�ers, radial basis function classi�ers, and perceptrons(Sections 2.2 and 2.3), relying on almost identical subsets of the training set, theirSupport Vector sets (Sec. 2.4). These Support Vector Sets are shown to containall the information necessary to solve a given classi�cation task. To understand therelationship between SV methods and classical techniques, we then describe a studycomparing SV machines with Gaussian kernels to classical radial basis function net-works, with results favouring the SV approach. Following this, Sec. 2.6 shows thatone can utilize the error bounds of learning theory to select values for free parametersin the SV algorithm, as for instance the degree of the polynomial kernel which willperform best on a test set (Sch�olkopf, Burges, and Vapnik, 1995; Blanz, Sch�olkopf,B�ultho�, Burges, Vapnik, and Vetter, 1996; Sch�olkopf, Sung, Burges, Girosi, Niyogi,Poggio, and Vapnik, 1996c). Finally, at the end of the chapter, we summarize variousways of understanding and interpreting the high generalization performance of SVmachines (Sec. 2.7).2.1 The Support Vector AlgorithmAs a basis for the material in the following section, we �rst need to describe the SValgorithm in some detail. The original treatments are due to Vapnik and Chervonenkis(1974), Boser, Guyon, and Vapnik (1992), Guyon, Boser, and Vapnik (1993), Cortesand Vapnik (1995), and Vapnik (1995b).We describe the SV algorithm in four steps. In Sec. 2.1.1, a structure of decisionfunctions is described which is su�ciently simple to admit the formulation of a boundon their VC-dimension. Based on this result, the optimal margin algorithm minimizes33

References

Page 93: A quick introduction to Statistical Learning Theory and Support Vector Machines

34 CHAPTER 2. SUPPORT VECTOR MACHINES.

w

●●

● {z | (w z) + b = 0}.

(w z) + b > 0.

(w z) + b < 0.

{z | (w z) + b = 0}.

{z | (c w z) + cb = 0}.

Note:

=

for c =0

FIGURE 2.1: A separating hyperplane, written in terms of a weight vectorw and a thresholdb. Note that by multiplying both w and b with the same nonzero constant, we obtain thesame hyperplane, represented in terms of di�erent parameters. Fig. 2.2 shows how toeliminate this scaling freedom.the VC-dimension for this class of decision functions (Sec. 2.1.2). This algorithm isthen generalized in two steps in order to obtain SV machines: nonseparable classi�ca-tion problems are dealt with in Sec. 2.1.3, and nonlinear decision functions, retainingthe VC-dimension bound, are described in Sec. 2.1.4.To be able to utilize the results of Sec. 1.3, we shall formulate the algorithm interms of dot products in some space F . Initially, we think of F as the input space.In Sec. 2.1.4, we will substitute kernels for dot products, in which case F becomes afeature space nonlinearly related to input space.2.1.1 A Structure on the Set of HyperplanesEach particular choice of a structure (1.8) gives rise to a learning algorithm, consistingof performing Structural Risk Minimization in the given structure of sets of functions.The SV algorithm is based on a structure on the set of separating hyperplanes.To describe it, �rst note that given a dot product space F and a set of patternvectors z1; : : : ; zr 2 F; any hyperplane can be written asfz 2 F : (w � z) + b = 0g: (2.1)In this formulation, we still have the freedom to multiply w and b with the samenonzero constant (Fig. 2.1). However, the hyperplane corresponds to a canonical pair

References

Page 94: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.1. THE SUPPORT VECTOR ALGORITHM 35.

w

●●

● {z | (w z) + b = 0}.

{z | (w z) + b = −1}.{z | (w z) + b = +1}.

x2x1

Note:

(w z1) + b = +1(w z2) + b = −1

=> (w (z1−z2)) = 2

=> (z1−z2) =w

||w||( )

.

.

.

. 2||w||

FIGURE 2.2: By requiring the scaling of w and b to be such that the point(s) closest to thehyperplane satisfy j(w �zi)+ bj = 1, we obtain a canonical form (w; b) of a hyperplane (cf.Fig. 2.1). Note that in this case, the margin, measured perpendicularly to the hyperplane,equals 2=kwk, which can be seen by considering two opposite points which precisely satisfyj(w � zi) + bj = 1.(w; b) 2 F �R if we additionally requiremini=1;:::;r j(w � zi) + bj = 1; (2.2)i.e. that the scaling of w and b be such that the point closest to the hyperplane hasa distance of 1=kwk (Fig. 2.2).1 Thus, the margin between the two classes, measuredperpendicular to the hyperplane, is at least 2=kwk. The possibility of introducing astructure on the set of hyperplanes is based on the following result (Vapnik, 1995b):Proposition 2.1.1 Let R be the radius of the smallest ball BR(a) = fz 2 F : kz �ak < Rg (a 2 F ) containing the points z1; : : : ; zr, and letfw;b = sgn ((w � z) + b) (2.3)be canonical hyperplane decision functions de�ned on these points. Then the set ffw;b :kwk � Ag has a VC-dimension h satisfyingh < R2A2 + 1: (2.4)1The condition (2.2) still allows two such pairs: given a canonical hyperplane (w; b), another onesatisfying (2.2) is given by (�w;�b). However, we do not mind this remaining ambiguity: �rst, thefollowing Proposition only makes use of kwk, which coincides in both cases, and second, these twohyperplanes correspond to di�erent decision functions sgn((w � z) + b).

References

Page 95: A quick introduction to Statistical Learning Theory and Support Vector Machines

36 CHAPTER 2. SUPPORT VECTOR MACHINESNote. Dropping the condition kwk � A leads to a set of functions whose VC-dimensionequals NF + 1, where NF is the dimensionality of F. Due to kwk � A, we can getVC-dimensions which are much smaller than NF , enabling us to work in very highdimensional spaces | remember that the risk bound (1.5) does not explicitely dependupon NF , but on the VC-dimension.To make Proposition 2.1.1 intuitively plausible, note that due to the inverse pro-portionality of margin and kwk, (2.4) essentially states that by requiring a large lowerbound on the margin (i.e. a small A), we obtain a small VC-dimension. Conversely, byallowing for separations with small margin, we can potentially separate a much largerclass of problems (i.e. a larger class of possible labellings of the training data, cf. thede�nition of the VC-dimension, following (1.6)).Recalling that (1.5) tells us to keep both the training error and the VC-dimensionsmall in order to achieve high generalization ability, we conclude that hyperplane deci-sion functions should be constructed such that they maximize the margin, and at thesame time separate the training data with as few exceptions as possible. Sections 2.1.2and 2.1.3 will deal with these two issues, respectively.2.1.2 Optimal Margin HyperplanesSuppose we are given a set of examples (z1; y1); : : : ; (z`; y`); zi 2 F; yi 2 f�1g, andwe want to �nd a decision function fw;b = sgn ((w � z) + b) with the propertyfw;b(zi) = yi; i = 1; : : : ; `: (2.5)If this function exists (the nonseparable case shall be dealt with in the next section),canonicality (2.2) impliesyi � ((zi �w) + b) � 1; i = 1; : : : ; `: (2.6)As an aside, note that out of the two canonical forms of the same hyperplane (w; b),(�w;�b), only one will satisfy equations (2.5) and (2.6). The existence of class labelsthus allows to distinguish two orientations of a hyperplane.Following Proposition 2.1.1, a separating hyperplane which generalizes well canthus be found by minimizing �(w) = 12kwk2 (2.7)subject to (2.6). To solve this convex optimization problem, one introduces a La-grangian L(w; b;�) = 12kwk2 � Xi=1 �i (yi((zi �w) + b)� 1) (2.8)with multipliers �i � 0. The Lagrangian L has to be maximized with respect to �i

References

Page 96: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.1. THE SUPPORT VECTOR ALGORITHM 37and minimized with respect to w and b. The condition that at the saddle point, thederivatives of L with respect to the primal variables must vanish,@@bL(w; b;�) = 0; @@wL(w; b;�) = 0; (2.9)leads to Xi=1 �iyi = 0 (2.10)and w = Xi=1 �iyizi: (2.11)The solution vector thus has an expansion in terms of training examples. Note that al-though the solution w is unique (due to the strict convexity of (2.7), and the convexityof (2.6)), the coe�cients �i need not be.According to the Kuhn-Tucker theorem of optimization theory (e.g. Bertsekas,1995), at the saddle point only those Lagrange multipliers �i can be nonzero whichcorrespond to constraints (2.6) which are precisely met, i.e.�i � [yi((zi �w) + b)� 1] = 0; i = 1; : : : ; `: (2.12)The patterns zi for which �i > 0 are called Support Vectors.2According to (2.12), they lie exactly at the margin.3 All remaining examples of thetraining set are irrelevant: their constraint (2.6) is satis�ed automatically, and theydo not appear in the expansion (2.11).4This leads directly to an upper bound on the generalization ability of optimal mar-gin hyperplanes: suppose we use the leave-one-out method to estimate the expectedtest error (e.g. Vapnik, 1979). If we leave out a pattern zi� and construct the solutionfrom the remaining patterns, there are several possibilities (cf. (2.6)):2This terminology is related to corresponding terms in the theory of convex sets, relevant to convexoptimization (e.g. Luenberger, 1973; Bertsekas, 1995). Given any boundary point of a convex set C,there always exists a hyperplane separating the point from the interior of the set. This is called asupporting hyperplane.SVs do lie on the boundary of the convex hulls of the two classes, thus they possess supportinghyperplanes. The SV optimal hyperplane is the hyperplane which lies in the middle of the two parallelsupporting hyperplanes (of the two classes) with maximum distance.Vice versa, from the optimal hyperplane one can obtain supporting hyperplanes for all SVs of bothclasses by shifting it by 1=kwk in both directions.3Note that this implies that the solution (w; b), where b is computed using the fact that yi((w �zi) + b) = 1 for SVs, is in canonical form with respect to the training data. (This makes use of thereasonable assumption that the training set contains both positive and negative examples.)4In a statistical mechanics framework, Anlauf and Biehl (1989) have put forward a similar argu-ment for the optimal stability perceptron, also computed by contrained optimization.

References

Page 97: A quick introduction to Statistical Learning Theory and Support Vector Machines

38 CHAPTER 2. SUPPORT VECTOR MACHINES1. yi� � ((zi� �w) + b) > 1, i.e. the pattern is classi�ed correctly and does not lie onthe margin. These are patterns that would not have become Support Vectorsanyway.2. yi� � ((zi� �w) + b) = 1, i.e. zi� exactly meets the constraint (2.6). In that case,the solution w does not change, even though the coe�cients �i in the dualformulation of the optimization problem might change: namely, if zi� mighthave become a Support Vector (i.e. �i� > 0) if it had been kept in the trainingset. In that case, the fact that the solution is the same no matter whether zi�is in the training set or not means that zi� can be written as PSVs �iyizi with�i � 0. Note that this is not equivalent to saying that zi� can be written assome linear combination of the remaining Support Vectors: since the sign of thecoe�cients in the linear combination is determined by the class of the respectivepattern, not any linear combination will do. Strictly speaking, zi� must lie inthe cone spanned by the yizi, where zi are all Support Vectors.53. 1 > yi� � ((zi� �w) + b) > 0, i.e. zi� lies within the margin, but still on the correctside of the decision boundary. In that case, the solution looks di�erent from theone obtained if zi� was in the training set (for, in that case, zi� would satisfy(2.6) after training), but classi�cation is nevertheless correct.4. 0 > yi� � ((zi� �w) + b). In that case, zi� will be classi�ed incorrectly.Note that the cases 3 and 4 necessarily correspond to examples which would havebecome SVs if kept in the training set; case 2 potentially includes such cases. However,only case 4 leads to an error in the leave-one-out procedure. Consequently, we havethe following result on the generalization error of optimal margin classi�ers (Vapnikand Chervonenkis, 1974):6Proposition 2.1.2 The expectation of the number of Support Vectors obtained duringtraining on a training set of size `, divided by `�1, is an upper bound on the expectedprobability of test error.A sharper bound can be formulated by making a further distinction in case 2, betweenSVs that must occur in the solution, and those that can be expressed in terms of theother SVs (Vapnik and Chervonenkis, 1974).Substituting the conditions for the extremum, (2.10) and (2.11), into the La-grangian (2.8), one derives the dual form of the optimization problem: maximizeW (�) = Xi=1 �i � 12 Xi;j=1�i�jyiyj(zi � zj) (2.13)5Possible non-uniquenesses of the solution's expansion in terms of SVs are related to zero Eigenval-ues of Kij = yiyjk(xi;xj), cf. Proposition 1.3.4. Note, however, the above caveat on the distinctionbetween linear combinations and linear combinations with coe�cients of �xed sign.6It also holds for the generalized versions of optimal margin classi�ers explained in the followingsections.

References

Page 98: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.1. THE SUPPORT VECTOR ALGORITHM 39subject to the constraints �i � 0; i = 1; : : : ; `; (2.14)Xi=1 �iyi = 0: (2.15)On substitution of the expansion (2.11) into the decision function (2.3), we obtain anexpression which can be evaluated in terms of dot products between the pattern to beclassi�ed and the Support Vectors,f(z) = sgn Xi=1 �iyi(z � zi) + b! : (2.16)It is interesting to note that the solution has a simple physical interpretation(Burges and Sch�olkopf, 1997). If we assume that each Support Vector zj exerts aperpendicular force of size �j and sign yj on a solid plane sheet lying along the hyper-plane w �z+ b = 0, then the solution satis�es the requirements of mechanical stability.The constraint (2.15) translates into the forces on the sheet summing to zero; and(2.11) implies that the torques zi � �iyiw=kwk also sum to zero. This mechanicalanalogy illustrates the physical meaning of the term Support Vector.2.1.3 Soft Margin HyperplanesIn practice, a separating hyperplane often does not exist. To allow for the possibilityof examples violating (2.6), Cortes and Vapnik (1995) introduce slack variables�i � 0; i = 1; : : : ; `; (2.17)and use relaxed separation constraints (cf. (2.6))yi((zi �w) + b) � 1� �i; i = 1; : : : ; `: (2.18)The SV approach to minimizing the guaranteed risk bound (1.5) consists of the fol-lowing: minimize �(w; �) = 12kwk2 + Xi=1 �i (2.19)subject to the constraints (2.17) and (2.18) (cf. (2.7)). Due to (2.4), minimizing the�rst term is related to minimizing the VC-dimension of the considered class of learningmachines, thereby minimizing the second term of the bound (1.5) (it also amounts tomaximizing the separation margin, cf. the remark following (2.2), and Fig. 2.2). ThetermPi=1 �i, on the other hand, is an upper bound on the number of misclassi�cationson the training set (cf. (2.18)) | this controls the empirical risk term in (1.5). For a

References

Page 99: A quick introduction to Statistical Learning Theory and Support Vector Machines

40 CHAPTER 2. SUPPORT VECTOR MACHINESsuitable positive constant , this approach therefore constitutes a practical implemen-tation of Structural Risk Minimization on the given set of functions.7 Note, however,that Pi=1 �i is signi�cantly larger than the number of errors if many of the �i attainlarge values, i.e. if the classes to be separated strongly overlap, for instance due tonoise. In these cases, there is no guarantee that the hyperplane will generalize well.As in the separable case (2.11), the solution can be shown to have an expansionw = Xi=1 �iyizi; (2.20)where nonzero coe�cients �i can only occur if the corresponding example (zi; yi) pre-cisely meets the constraint (2.18). The coe�cients �i are found by solving the followingquadratic programming problem: maximizeW (�) = Xi=1 �i � 12 Xi;j=1�i�jyiyj(zi � zj) (2.21)subject to the constraints 0 � �i � ; i = 1; : : : ; `; (2.22)Xi=1 �iyi = 0: (2.23)2.1.4 Nonlinear Support Vector MachinesAlthough we have already introduced the concept of Support Vectors, one crucialingredient of SV machines in their full generality is still missing: to allow for muchmore general decision surfaces, one can �rst nonlinearly transform a set of input vectorsx1; : : : ;x` into a high-dimensional feature space by a map � : xi 7! zi and then do alinear separation there.Note that in all of the above, we made no assumptions on the dimensionality ofF . We only required F to be equipped with a dot product. The patterns zi that wetalked about in the previous sections thus need not coincide with the input patterns.They can equally well be the results of mapping the original input patterns xi into ahigh-dimensional feature space.Maximizing the target function (2.21) and evaluating the decision function (2.16)then requires the computation of dot products (�(x) � �(xi)) in a high-dimensionalspace. Under Mercer's conditions, given in Proposition 1.3.2, these expensive calcula-tions can be reduced signi�cantly by using a suitable function k such that(�(x) ��(xi)) = k(x;xi); (2.24)7It slightly deviates from the Structural Risk Minimization (SRM) Principle in that (a) it doesnot use the bound (1.5), but a related quantity (2.19) which can be minimized e�ciently, and (b) theSRM Principle strictly speaking requires the structure of sets of functions to be �xed a priori. Formore details, cf. Vapnik (1995b); Shawe-Taylor, Bartlett, Williamson, and Anthony (1996).

References

Page 100: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.1. THE SUPPORT VECTOR ALGORITHM 41

R3

R2

R2

Φ

Φ:R2 R3

x1 x2

w1w2 w3

f (x)

input space

feature space2 x1x2 x12 x2

2

f (x)=sgn (w1x1+w2x2+w3 2 x1x2+b)2 2

FIGURE 2.3: By mapping the input data (top left) nonlinearly (via �) into a higher-dimensional feature space F (here: R3), and constructing a separating hyperplane there(bottom left), an SV machine (top right) corresponds to a nonlinear decision surface ininput space (here: R2, bottom right).leading to decision functions of the formf(x) = sgn Xi=1 yi�i � k(x;xi) + b! : (2.25)Consequently, everything that has been said about the linear case also appliesto nonlinear cases obtained by using a suitable kernel k instead of the Euclideandot product (Fig. 2.3). By using di�erent kernel functions, the SV algorithm canconstruct a variety of learning machines (Fig. 2.4), some of which coincide with classicalarchitectures:

References

Page 101: A quick introduction to Statistical Learning Theory and Support Vector Machines

42 CHAPTER 2. SUPPORT VECTOR MACHINESΣ f(x)= sgn ( + b)

input vector x

classification

comparison: e.g.k k k k

support vectors x 1

... x 4

weightsλ1 λ2 λ3 λ4

k(x,x i)=exp(−||x−x i||2 / c)

k(x,x i)=tanh(κ(x.x i)+θ)

k(x,x i)=(x.x i)d

f(x)= sgn ( Σ λi k(x,x i) + b)i

FIGURE 2.4: Architecture of SV machines. The kernel function k is chosen a priori; itdetermines the type of classi�er (e.g. polynomial classi�er, radial basis function classi�er,or neural network). All other parameters (number of hidden units, weights, threshold b)are found during training by solving a quadratic programming problem. The �rst layerweights xi are a subset of the training set (the Support Vectors); the second layer weights�i = yi�i are computed from the Lagrange multipliers (cf. (2.25)).Polynomial classi�ers of degree d:k(x;xi) = (x � xi)d (2.26)Radial basis function classi�ers:k(x;xi) = exp ��kx� xik2=c� (2.27)Neural networks: k(x;xi) = tanh(� � (x � xi) + �) (2.28)To �nd the decision function (2.25), we maximize (cf. (2.21))W (�) = Xi=1 �i � 12 Xi;j=1�i�jyiyjk(xi;xj) (2.29)

References

Page 102: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.1. THE SUPPORT VECTOR ALGORITHM 43subject to the constraints (2.22) and (2.23). Since k is required to satisfy Mercer'sconditions, it corresponds to a dot product in another space (2.24), thus Kij :=(yiyjk(xi;xj))ij is a positive matrix, providing us with a problem that can be solvede�ciently. To see this, note that (cf. Proposition 1.3.4)Xi;j=1�i�jyiyjk(xi;xj) = 0@Xi=1 �iyi�(xi) � Xj=1�jyj�(xj)1A � 0 (2.30)for all � 2 R`.To compute the threshold b, one takes into account that due to (2.18), for SupportVectors xj for which �j = 0, we haveXi=1 yi�i � k(xj;xi) + b = yj: (2.31)Thus, the threshold can for instance be obtained by averagingb = yj � Xi=1 yi�i � k(xj;xi) (2.32)over all Support Vectors xj (i.e. 0 < �j) with �j < .Figure 2.5 shows how a simple binary toy problem is solved by a Support Vectormachine with a radial basis function kernel (2.27).2.1.5 SV Regression EstimationThis thesis is primarily concerned with pattern recognition. Nevertheless, we brie ymention the case of SV regression (Vapnik, 1995b; Smola, 1996; Vapnik, Golowich,and Smola, 1997). To estimate a linear regression (Fig. 2.6)f(z) = (w � z) + b (2.33)with precision ", one minimizes�(w; �; ��) = 12kwk2 + Xi=1(�i + ��i ) (2.34)subject to ((w � zi) + b)� yi � "+ �i (2.35)yi � ((w � zi) + b) � "+ ��i (2.36)�i; ��i � 0 (2.37)

References

Page 103: A quick introduction to Statistical Learning Theory and Support Vector Machines

44 CHAPTER 2. SUPPORT VECTOR MACHINES

FIGURE 2.5: Example of a Support Vector classi�er found by using a radial basis functionkernel k(x;y) = exp(�kx � yk2). Both coordinate axes range from -1 to +1. Circlesand disks are two classes of training examples; the middle line is the decision surface; theouter lines precisely meet the constraint (2.6). Note that the Support Vectors found bythe algorithm (marked by extra circles) are not centers of clusters, but examples which arecritical for the given classi�cation task (cf. Sec. 2.5). Grey values code the modulus ofthe argument Pi=1 yi�i � k(x;xi) + b of the decision function (2.25). (From Sch�olkopf,Burges, and Vapnik (1996a).)for all i = 1; : : : ; `.Generalization to nonlinear regression estimation is carried out using kernel func-tions, in complete analogy to the case of pattern recognition. A suitable choice ofthe kernel function then allows the construction of multi-dimensional splines (Vapnik,Golowich, and Smola, 1997).Di�erent types of loss functions can be utilized to cope with di�erent types of noisein the data (M�uller, Smola, R�atsch, Sch�olkopf, Kohlmorgen, and Vapnik, 1997; Smolaand Sch�olkopf, 1997b).2.1.6 Multi-Class Classi�cationTo get k-class classi�ers, we construct a set of binary classi�ers f 1; : : : ; fk, each trainedto separate one class from the rest, and combine them by doing the multi-class clas-si�cation according to the maximal output before applying the sgn function, i.e. by

References

Page 104: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.1. THE SUPPORT VECTOR ALGORITHM 45

x

x

xx

x

xxx

x

xx

x

x

x

+ε−ε

x

ζ+ε

−ε0

ζ

FIGURE 2.6: In SV regression, a desired accuracy " is speci�ed a priori. It is then attemptedto �t a tube with radius " to the data. The trade-o� between model complexity and pointslying outside of the tube (with positive slack variables �) is determined by minimizing(2.34).taking argmaxj=1;:::;kgj(x); where gj(x) = Xi=1 yi�ji � k(x;xi) + bj (2.38)(note that f j(x) = sgn(gj(x)), cf. (2.25)). The values gj(x) can also be used forreject decisions (e.g. Bottou et al., 1994), for instance by considering the di�erencebetween the maximum and the second highest value as a measure of con�dence in theclassi�cation.In the following sections, we shall report experimental results obtained with the SValgorithm. We used the Support Vector algorithm with standard quadratic program-ming techniques8 to construct polynomial, radial basis function and neural networkclassi�ers. This was done by choosing the kernels (2.26), (2.27), (2.28) in the decisionfunction (2.25) and in the function (2.29) to be maximized under the constraints (2.22)and (2.23). We shall start with object recognition experiments (Sec. 2.2), and thenmove to handwritten digit recognition (Sec. 2.3).8An existing implementation at AT&T Bell Labs was used, largely programmed by L. Bottou,C. Burges, and C. Cortes.

References

Page 105: A quick introduction to Statistical Learning Theory and Support Vector Machines

46 CHAPTER 2. SUPPORT VECTOR MACHINES

FIGURE 2.7: Examples from the entry level (top) and animal (bottom) databases. Left:rendered views of two 3-D models; right: 16� 16 downsampled images, and four 16� 16downsampled edge detection patterns.2.2 Object Recognition Results2.2.1 Entry-Level and Animal RecognitionFor purposes of psychophysical and computational studies, the object recognitiongroup at the Max-Planck-Institut f�ur biologische Kybernetik has compiled three data-bases of rendered 3D CAD models. The entry level database (see Appendix A forsnapshots and further description) comprises views of 25 3-D object models, whichin psychophysical experiments were found to belong to di�erent entry level categories(Liter et al., 1997). Objects tend to get identi�ed by humans �rst at a particular levelof abstraction which is neither the most general nor the most speci�c, e.g. an objectmight be identi�ed �rst as an apple, rather than as a piece of fruit or as a cox orange.For a discussion of this concept, referred to as entry (or basic) level, see (Jolicoeur,Gluck, and Kosslyn, 1984; Rosch, Mervis, Gray, Johnson, and Boyes-Braem, 1976).In subordinate level recognition, on the other hand, �ner distinctions between objectssharing the same entry level become relevant, as for instance those between di�erenttypes of birds contained in the second database, the animal database (Appendix A).It should be noted, however, that the animal database does not pose a purely subor-dinate level recognition task, since many of its animals are also distinct on the entrylevel. The third MPI database, containing 25 chairs, however, can be considered asubordinate level database. We will use this one in Sec. 2.2.2.In order to recognize the objects from all orientations of the upper viewing hemi-sphere, a fairly complex decision surface in high-dimensional space must be learnt.The objects were realistically rendered and then downsampled. Compared to manyreal-world databases, the database should be considered as containing relatively lit-tle noise; in particular, they do not contain wrongly labeled patterns. Under these

References

Page 106: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.2. OBJECT RECOGNITION RESULTS 47circumstances, we reasoned that it should be possible to separate the data with zerotraining error even with moderate classi�er complexity, and we decided to determinethe value of the constant (cf. (2.19)) by the following heuristic: out of all values 10n,with integer n, we chose the smallest one which made the problem separable. On theentry level databases, this led to = 1000, on the animal databases, to = 100. Ofboth databases, we used 12 variants, obtained by� choosing one of three database sizes: 25, 89, (regularly spaced) or 100 (random,uniformly distributed) views per object;� choosing either grey-scale images or binarized silhouette images (both in down-sampled versions); and� using just 16�16 resolution images, obtained from the original images by down-sampling, or additionally four more 16 � 16 patterns, containing downsampledversions of edge detection results obtained from the original images. Note thatin the latter case, the resulting 1280-dimensional vectors contain informationwhich is not contained in the 16 � 16 images, since the edge detection, involv-ing a (nonlinear) modulus operation, is done before downsampling (cf. Blanz,Sch�olkopf, B�ultho�, Burges, Vapnik, and Vetter, 1996).For more details on the databases, see (Liter et al., 1997), and Appendix A. Exampleimages of the original models, and of the downsampled images and edge detectionpatterns for the entry level and the animal databases are given Fig. 2.7.We trained polynomial SV machines on these 25-class recognition tasks, and ob-tained accuracies which in some cases exceeded 99% (see Table 2.1). A few aspects ofthe results deserve being pointed out:Performance. The highest recognition rates were obtained using polynomial SV clas-si�ers of degrees around 20; however, we found no pronounced minimum. Generally,all of the higher degrees a�orded high accuracies. The regularly spaced 89-view-per-object set led to higher accuracies than the random 100-view-per-object set. Thissuggests that regular spacing of the views on the viewing sphere corresponds to auseful spacing of the knots (or centers) of the approximating functions in RN . Edgedetection information signi�cantly cuts errors rates, in many cases by a factor of twoor more. Generally, accuracies were higher for grey-scale images than they were for sil-houettes. The di�erences, however, were not large: high accuracies were also obtainedfor silhouettes. To understand this, we have to note that the thresholding operationused to produce silhouettes was applied to the original high-resolution images, andnot to the downsampled versions. After downsampling, this yields grey-scale imageswhose grey values do not code grey values in the original image, however, they do stillcode useful information on the high-resolution object silhouettes.

References

Page 107: A quick introduction to Statistical Learning Theory and Support Vector Machines

48 CHAPTER 2. SUPPORT VECTOR MACHINESTABLE 2.1: Object recognition test error rates on di�erent databases of 25 objects, usingpolynomial SV classi�ers of varying degrees. The training sets containing 25 and 89 viewsper object were regularly spaced; those with 100 views were distributed uniformly. Testingwas done on an independent test set of 100 random views per object. All views were takenfrom the upper viewing hemisphere. For further discussion, see Sec. 2.2.1.degree: 1 3 6 9 12 15 20 25entry level:25 grey scale 26.0 17.7 15.4 13.9 13.1 13.0 13.0 14.689 grey scale 14.5 3.4 2.4 2.0 1.8 1.8 1.8 2.1100 grey scale 17.1 5.6 4.2 3.5 3.2 2.8 2.4 2.825 silhouettes 27.1 19.6 17.9 16.7 16.2 15.6 15.4 16.389 silhouettes 17.2 4.3 3.3 2.7 2.5 2.2 2.2 2.8100 silhouettes 18.2 6.9 5.4 4.8 4.2 4.0 4.0 4.7entry level with edge detection:25 grey scale 9.0 8.0 6.7 5.8 5.5 5.3 4.9 5.689 grey scale 1.9 1.2 0.8 0.7 0.6 0.5 0.4 0.4100 grey scale 3.5 2.3 1.8 1.5 1.3 1.1 1.1 1.025 silhouettes 9.4 8.2 7.6 7.0 6.6 6.5 6.1 6.089 silhouettes 2.4 1.7 1.2 0.8 0.6 0.5 0.5 0.4100 silhouettes 3.8 3.0 2.6 2.5 2.5 2.4 2.3 2.2animals:25 grey scale 31.6 20.4 15.9 14.8 13.8 13.4 13.0 13.889 grey scale 21.8 5.6 3.2 2.5 2.0 1.7 1.7 2.0100 grey scale 24.5 8.8 5.8 5.2 5.0 4.7 4.8 4.425 silhouettes 34.4 22.4 18.2 17.0 16.4 15.6 15.8 16.489 silhouettes 27.0 7.4 3.8 2.8 2.5 2.5 2.2 2.8100 silhouettes 29.1 11.0 7.4 6.3 5.8 5.4 5.2 5.7animals with edge detection:25 grey scale 11.8 9.0 7.9 7.2 6.9 6.8 6.4 6.489 grey scale 3.2 1.5 1.1 1.0 0.9 0.9 0.8 0.8100 grey scale 4.7 3.3 2.7 2.2 2.2 2.0 2.0 2.025 silhouettes 12.1 9.9 8.8 8.0 7.6 7.5 7.0 7.189 silhouettes 3.7 2.0 1.3 1.2 1.1 1.2 1.1 1.1100 silhouettes 5.4 4.0 3.2 3.1 3.0 2.9 2.7 2.6

References

Page 108: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.2. OBJECT RECOGNITION RESULTS 49TABLE 2.2: Numbers of SVs for the object recognition systems of Table 2.1, on di�erentdatabases of 25 objects, using polynomial SV classi�ers of varying degrees. The trainingsets containing 25 and 89 views per object were regularly spaced; the ones with 100 viewswere distributed uniformly. The numbers of SVs are averages over all 25 binary classi�ersseparating one object from the rest; they should be seen in relation to the size of thetraining set, which for the above numbers of views per objects was 625, 2225, and 2500,respectively. The given numbers of SVs thus amount to roughly 10% of the database sizes.For the silhouette databases, the numbers (not shown here) are very similar, only slightlybigger. degree: 1 3 6 9 12 15 20 25entry level:25 grey scale 86 74 71 70 72 74 79 9289 grey scale 219 148 132 128 128 133 144 165100 grey scale 206 139 121 117 119 122 135 158entry level with edge detection:25 grey scale 73 74 77 79 84 87 91 9989 grey scale 126 119 125 130 137 145 151 161100 grey scale 123 115 120 125 129 133 143 153animals:25 grey scale 108 96 89 90 91 95 100 11289 grey scale 231 196 180 177 178 183 193 208100 grey scale 235 196 176 169 169 174 185 199animals with edge detection:25 grey scale 101 92 93 99 103 107 117 12889 grey scale 183 170 172 180 188 198 212 227100 grey scale 187 171 172 177 182 191 201 215Support Vectors. The numbers of SVs (Table 2.2) of the individual recognizers foreach object make up about 5%� 15% of the whole databases. The fraction decreaseswith increasing database size.For polynomial machines of degree 1 (i.e. separating hyperplanes in input space),the problem is not separable. In that case, all training errors show up as SVs (cf.(2.18)), causing a fairly large number of SVs. For degrees higher than 1, the numberof SVs slightly increases with increasing polynomial degree. However, the increase israther moderate, compared with the increase of the dimensionality of the feature spacethat we are implicitely working on (cf. Sec. 1.3). Interestingly, the number of SVs does

References

Page 109: A quick introduction to Statistical Learning Theory and Support Vector Machines

50 CHAPTER 2. SUPPORT VECTOR MACHINES30

210

60

240

90

270

120

300

150

330

180 0

25

n=4632

30

210

60

240

90

270

120

300

150

330

180 0

25

n=3385

30

60

90

0

25

n=4632

30

60

90

0

25

n=3385FIGURE 2.8: Angular distribution of the viewing angles of those training views whichbecame SVs, for a polynomial SV machine of degree 20 on the animal (left) and entrylevel (right) databases (100 grey level views per object, without edge detection). Theplotted distributions for azimuth (top) and elevation (bottom) have been normalized bythe corresponding distributions in the training set (see Fig. A.1). It can be seen thatSVs tend to occur more often for top, front and back views. In this and the followingplots, views which become SVs for more than one of the 25 binary recognizers are countedaccording to their frequency of occurence. Consequently, there is no contradiction in theoverall number of SVs n exceeding the database size (2500).not change much if we add edge detection information, even though this increases theinput dimensionality by a factor of 5.As each of the training examples is associated with two viewing angles (�; �) (cf.Appendix A), we can look at the angular distribution of SVs and errors. It is shownin �gures 2.8 { 2.10, and, in more detail, in �gures B.2 { B.9 in the appendix (there,we also give an example of a full SV set of one of the binary recognizers, in Fig. B.1).The density of SVs is increased at high polar angles, i.e. for viewing the objects fromthe top. Also, SVs tend to be found more often for frontal and back views than forviews closer to the side. Top, frontal and back views typically are harder to classifythan views from more generic points of view (Blanz, 1995). We can thus interpret our

References

Page 110: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.2. OBJECT RECOGNITION RESULTS 51�nding as an indication that the density of SVs is related to the local complexity ofthe classi�cation surface, i.e. the local di�culty of the classi�cation task. Indeed, thesame qualitative behaviour is found for the distribution of recognition errors (�gures2.9 and 2.10).There are several factors contributing to the di�culty in classifying top, frontaland back views. First note that since most objects in our databases are bilaterallysymmetric, top, frontal and back views contain a large amount of redundancy. Incontrast, side views of symmetric objects contain a maximal amount of information.Moreover, many objects in the databases have their main axis of elongation roughlyaligned with the direction of the zero view (� = 0; � = 0). Consequently, frontal andback views su�er the drawback of showing a projection of a comparably small area.As an aside, note that although the SVs live in a high-dimensional space, theparticular setup of the presented object recognition experiments made it possible todiscuss the relationship between the di�culty of the task, the distribution of SVs, andthe distribution of errors. This is due to the low-dimensional parametrization of theexamples, arising from the procedure of generating the examples by taking snapshotsat well-de�ned viewing positions.The hope that Support Vectors are a useful means of analysing recognition taskswill receive further support in Sec. 2.4, where we shall present results which showthat di�erent types of SV machines, obtained using di�erent kernel functions, lead tolargely the same Support Vectors if trained on the same task.Comparison with Neural Networks. To evaluate the performance of SV classi�erson this task, benchmark comparisons with other classi�ers need to be carried out.We conducted a set of experiments using perceptrons with one hidden layer, endowedwith 400 hidden neurons, and hyperbolic tangent activation functions in hidden andoutput neurons. The networks were trained by back-propagation of mean squared error(Rumelhart, Hinton, and Williams, 1986; LeCun, 1985). We used on-line (stochastic)gradient descent, i.e. the weights were updated after each pattern; training was stoppedwhen the training error dropped below 0:1%, or after 600 learning epochs, whicheveroccured earlier. Neither this procedure nor the network design was carefully optimizedfor the task at hand, thus the results reported in the following should be seen as baselinecomparisons solely intended to facilitate assessing the reported SV results.99By observing the dependency of the test error on the number of learning epochs, we were able tosee that the networks were not overtrained. In addition, experiments with smaller numbers of hiddenunits gave worse performance (larger networks were not used, for reasons of excessive training times),hence the network capacities did not seem too large.A full- edged comparison between SV machines and perceptrons would take into account the fol-lowing issues in order to obtain optimized network designs: instead of one fully connected hiddenlayer, more sophisticated architectures use several layers with shared weights, extracting featuresof increasing complexity and invariance, while still limiting the number of free parameters. Otherregularization techniques useful for improving generalization include weight decay and pruning. Sim-ilarly, early stopping can be used to deal with issues of overtraining. The training procedure can beoptimized by using di�erent error functions and output functions (e.g. softmax). Finally, for small

References

Page 111: A quick introduction to Statistical Learning Theory and Support Vector Machines

54 CHAPTER 2. SUPPORT VECTOR MACHINESFIGURE 2.11: Left: rendered view of a 3-D model from the chair database; right: 16� 16downsampled image, and four 16� 16 downsampled edge detection patterns.For the following two reasons, we chose the small training set, with 25 views perobject. First, the error rates reported above for the large sets were already very low,and di�erences in performance are thus more likely to be signi�cant for the smallertraining sets. Second, training times of the neural networks were very long (in thecases reported in the following, they were longer than for SV machines by more thanan order of magnitude).On the 25 view-per-object training sets, we obtained error rates of 17.3% and21.4% on the entry level and animal databases, respectively. Adding edge detectioninformation, the error rates dropped to 6.8%, and 11.2%, respectively. Comparingwith the results in (2.1), we note that SV machines in almost all cases performedbetter. Further performance comparisons between SV machines and other classi�ersare reported in the following section.2.2.2 Chair Recognition BenchmarkIn a set of experiments using the MPI chair database (Fig. 2.11, Appendix A), di�er-ent view{based recognition algorithms were compared (Blanz et al., 1996). The SVanalysis for this case is less detailed than the one given in Sec. 2.2.1, however, wedecided also to report these experiments, since they include further benchmark resultsobtained with other classi�ers. The �rst one used oriented �lters to extract featureswhich are robust with respect to small rigid transformations of the underlying 3-Dobjects, followed by a decision stage based on comparisons with stored templates (fordetails, see Blanz, 1995; Vetter, 1994; Blanz et al., 1996). The second one, run as abaseline benchmark, was a perceptron with one hidden layer, trained by error back-propagation to minimize the mean squared error (for further details, see Sec. 2.2.1).The third system was a polynomial Support Vector machine (cf. (2.26)) with degreed = 15 and = 100.10 In addition, we report results of Kressel (1996), who uti-lized a fully quadratic polynomial classi�er (Sch�urmann, 1996) trained on the �rst 50databases with little redundancy it is sometimes advantageous to use batch updates with conjugategradient descent, or using higher order derivatives of the error function. For details, see LeCun,Boser, Denker, Henderson, Howard, Hubbard, and Jackel (1989); Bishop (1995); Amari, Murata,M�uller, Finke, and Yang (1997).10The latter was chosen as in Sec. 2.2.1. Note that these values di�er from those used in (Blanzet al., 1996), which in some cases leads to di�erent results.

References

Page 112: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.2. OBJECT RECOGNITION RESULTS 55TABLE 2.3: Recognition test error (in %) for the MPI chair database (Appendix A) on25 � 100 random test views from the upper viewing hemisphere, for di�erent trainingsets (viewing angles either regularly spaced, or uniformly distributed, on the upper viewinghemisphere; views were either just 16�16 images, or images plus edge detection data), anddi�erent classi�ers: SV: Support Vector machine; MLP: fully connected perceptron withone hidden layer of 400 neurons; OF: oriented �lter invariant feature extraction, see text;PC: quadratic polynomial classi�er trained on the �rst 50 principal components (Kressel,1996). Where marked with '{', results are not available.training set classi�erinput distribution views per obj. SV MLP OF PCimages+e.d. regul. spaced 25 5.0 8.8 5.4 {images+e.d. regul. spaced 89 1.0 1.3 4.7 1.7images+e.d. random 100 1.4 2.6 { {images+e.d. random 400 0.3 { { 0.8images regul. spaced 25 13.2 25.4 26.0 {images regul. spaced 89 2.0 7.2 21.0 {images random 100 4.5 7.5 { {images random 400 0.6 { { {principal components of the images.In all experiments, the Support Vector machine exhibits the highest generalizationability (Table 2.3). Considering that the images of a single object can change drasti-cally with viewpoint (cf. Appendix A), it seems that the Support Vector machine isbest in constructing a decision surface su�ciently complex to separate the 25 classes ofchairs. This, in turn, can be related to the fact that SV machines use kernel functionsto construct hyperplanes in very high-dimensional feature spaces without over�tting.Note, moreover, that this was achieved with an SV machine which does not utilizeprior information about the problem at hand. The oriented �lter approach, in con-strast, does use prior information about the process by which the images arose fromunderlying 3-D objects. This knowledge was used to handcraft the robust featuresused for recognition. The SV machine has to extract all information from the giventraining data, making it understandable that its advantage over the oriented �ltersystem gets smaller for smaller training set sizes (Table 2.3). In Chapter 4, we try todeal with this shortcoming by proposing methods to incorporate prior knowledge intoSV machines.2.2.3 DiscussionRealistically rendered computer graphics images of objects provide a useful basis forevaluating object recognition algorithms. This setup enabled us to study shape re-

References

Page 113: A quick introduction to Statistical Learning Theory and Support Vector Machines

56 CHAPTER 2. SUPPORT VECTOR MACHINEScognition under controlled conditions. Real-world recognition systems, however, faceadditional problems. For instance, segmentation of objects in cluttered scenes is aproblem not addressed in the above experiments. Partly, these additional problemscan be outweighed by additional sources of information. Objects with di�erent albedoand color would facilitate segmentation and recognition signi�cantly.The impact of noise, characteristic of many real-life problems, should not be toobig, at least in the case where we trained our systems on the image data only: in thatcase, all the processing is done in the low spatial frequency domain.On all three databases, high recognition accuracies were reported. The highestaccuracies were obtained using the regularly spaced 89 view per object training sets,and the edge detection data.As the number of classes was 25 in all cases, we can compare the performanceof the SV systems across tasks. It correlates with the intuitive di�culty of the tasks:accuracies are highest for the entry level database, were the objects have the largest dif-ferences, followed by the animal database, and by the subordinate level chair database.2.3 Digit Recognition Using Di�erent KernelsHandwritten digit recognition has long served as a test bed for evaluating and bench-marking classi�ers (e.g. LeCun et al., 1989; Bottou et al., 1994; LeCun et al., 1995).Thus, it is imperative to evaluate the SV method on some widely used digit recogni-tion task. In the present chapter, we use the US Postal Service (USPS) database forthis purpose (Appendix C). We put particular emphasis on comparing di�erent typesof SV classi�ers obtained by choosing di�erent kernels. We report results for poly-nomial kernels (2.26), radial basis function kernels (2.27), and sigmoid kernels (2.28);all of them were obtained with = 10 (our default choice, used wherever not statedotherwise | cf. (2.19)).Results for the three di�erent kernels are summarized in Table 2.4. In all threecases, error rates around 4% can be achieved. They should be compared with valuesachieved on the same database with a �ve-layer neural net (LeNet1, LeCun, Boser,Denker, Henderson, Howard, Hubbard, and Jackel, 1989), 5.0%, a neural net with onehidden layer, 5.9%, and the human performance, 2.5% (Bromley and S�ackinger, 1991).Results of classical RBF machines, along with further reference results, are quoted inSec. 2.5.3.The results show that the Support Vector algorithm allows the construction ofvarious learning machines, all of which are performing well. The similar performancefor the three di�erent functions k suggests that among these cases, the choice of theset of decision functions is less important than capacity control in the chosen type ofstructure. This phenomenon is well-known for the Parzen density estimator in RN ,p(x) = 1 Xi=1 1!N k �x� xi! � : (2.39)There, it is of great importance to choose an appropriate value of the bandwidth

References

Page 114: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.4. UNIVERSALITY OF THE SUPPORT VECTOR SET 57TABLE 2.4: Performance on the USPS set, for three di�erent types of classi�ers, con-structed with the Support Vector algorithm by choosing di�erent functions k in (2.25) and(2.29). Given are raw errors (i.e. no rejections allowed) on the test set. The normalizationfactor c = 1:04 in the sigmoid case is chosen such that c � tanh(2) = 1. For each ofthe ten-class-classi�ers, we also show the average number of Support Vectors of the tentwo-class-classi�ers. The normalization factors of 256 are tailored to the dimensionality ofthe data, which is 16� 16.polynomial: k(x;y) = ((x � y)=256)dd 1 2 3 4 5 6 7raw error/% 8.9 4.7 4.0 4.2 4.5 4.5 4.7av. # of SVs 282 237 274 321 374 422 491RBF: k(x;y) = exp (�kx� yk2=(256 c))c 4.0 2.0 1.2 0.8 0.5 0.2 0.1raw error/% 5.3 5.0 4.9 4.3 4.4 4.4 4.5av. # of SVs 266 240 233 235 251 366 722sigmoid: k(x;y) = 1:04 tanh(2(x � y)=256 + �)�� 0.8 0.9 1.0 1.1 1.2 1.3 1.4raw error/% 6.3 4.8 4.1 4.3 4.3 4.4 4.8av. # of SVs 206 242 254 267 278 289 296parameter ! for a given amount of data (e.g. H�ardle, 1990; Bishop, 1995). Similarparallels can be drawn to the solution of ill-posed problems (for a complete discussion,see Vapnik, 1995b).2.4 Universality of the Support Vector Set11In the present section, we report empirical evidence that the SV set provides a novelpossibility for extracting a small subset of a database which contains all the informationnecessary to solve a given classi�cation task: using the Support Vector algorithm totrain three di�erent types of handwritten digit classi�ers, we observed that these typesof classi�ers construct their decision surface from strongly overlapping yet small subsetsof the database.Overlap of SV Sets. To study the Support Vector sets for three di�erent types ofSV classi�ers, we used the optimal parameters on the USPS set according to Table 2.4.11Copyright notice: the material in this section is based on the article \Extracting support datafor a given task" by B. Sch�olkopf, C. Burges and V. Vapnik, which appeared in: Proceedings, FirstInternational Conference on Knowledge Discovery & Data Mining, pp. 252 { 257, 1995. AIII Press.

References

Page 115: A quick introduction to Statistical Learning Theory and Support Vector Machines

58 CHAPTER 2. SUPPORT VECTOR MACHINESTABLE 2.5: First row: total number of di�erent Support Vectors of three di�erent ten-class-classi�ers (i.e. number of elements of the union of the ten two-class-classi�er SupportVector sets) obtained by choosing di�erent functions k in (2.25) and (2.29); second row:average number of Support Vectors per two-class-classi�er (USPS database size: 7291).Polynomial RBF Sigmoidtotal # of SVs 1677 1498 1611average # of SVs 274 235 254TABLE 2.6: Percentage of the Support Vector set of [column] contained in the supportset of [row]; for ten-class classi�ers (top) and binary recognizers for digit class 7 (bottom)(USPS set). Polynomial RBF SigmoidPolynomial 100 93 94RBF 83 100 87Sigmoid 90 93 100Polynomial RBF SigmoidPolynomial 100 84 93RBF 89 100 92Sigmoid 93 86 100TABLE 2.7: Comparison of all three Support Vector sets at a time (USPS set). For eachof the (ten-class) classi�ers, \% intersection" gives the fraction of its Support Vector setshared with both the other two classi�ers. Out of a total of 1834 di�erent Support Vectors,1355 are shared by all three classi�ers; an additional 242 is common to two of the classi�ers.Poly RBF tanh intersection shared by 2 unionno. of SVs 1677 1498 1611 1355 242 1834% intersection 81 90 84 100 { {Table 2.5 shows that all three classi�ers use around 250 Support Vectors per two-class-classi�er (less than 4% of the training set). The total number of di�erent SupportVectors of the ten-class-classi�ers is around 1600. The reason why it is less than 2500(ten times the above 250) is the following: a particular vector that has been used as apositive SV (i.e. yi = +1 in (2.25)) for digit 7 might at the same time be a negativeSV (yi = �1) for digit 1, say.Tables 2.6 and 2.7 show that the Support Vector sets of the di�erent classi�ers have

References

Page 116: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.4. UNIVERSALITY OF THE SUPPORT VECTOR SET 59TABLE 2.8: SV set overlap experiments on the MNIST set (Fig. C.2), using the binaryrecognizer for digit 0. Top three tables: performances (on the 60000 element test set) andnumbers of SVs for three di�erent kernels and various parameter choices. The numbersof SVs, which should be compared to the database size, 60000, were used to select theparameters for the SV set comparison: to get a balanced comparison of the di�erentSV sets, we decided to select parameter values such that the respective SV sets haveapproximately equal size (polynomial degree d = 4, radial basis function width c = 0:6,and sigmoid threshold � = �1:5). Bottom: SV set comparison. For each of the binaryclassi�ers, \% intersection" gives the fraction of its Support Vector set shared with both theother two classi�ers. The scaling factor 784 in the kernels stems from the dimensionalityof the data; it ensures that the values of the kernels lie in similar ranges for di�erentpolynomial degrees. polynomial: k(x;y) = ((x � y)=784)dd 2 3 4 5 6 7# of test errors 163 147 135 131 127 127# of SVs 994 1083 1187 1292 1401 1537RBF: k(x;y) = exp (�kx� yk2=(784 c))c 1 0.75 0.6 0.5 0.4 0.3# of test errors 147 145 145 141 137 134# of SVs 1061 1118 1179 1264 1308 1460sigmoid: k(x;y) = 1:04 tanh(2(x � y)=784 + �)�� 1.3 1.4 1.5 1.6 1.7 1.8# of test errors 139 138 138 141 145 144# of SVs 1137 1162 1194 1211 1223 1217Polyn RBF tanh intersection shared by 2 unionno. of SVs 1187 1179 1194 1054 124 1328% intersection 89 89 88 100 { {about 90% overlap. This surprising result, �rst published in (Sch�olkopf, Burges, andVapnik, 1995), has meanwhile been reproduced on the MNIST character recognition set(Table 2.8), with SV sets which amounted to just 2% of the whole database. Togetherwith K. Sung at MIT, we have reproduced this result also on a face detection task(binary classi�cation, faces vs. non-faces).As mentioned previously, the Support Vector expansion (2.11) need not be unique.Depending on the way the quadratic programming problem is solved, one can poten-tially get di�erent expansions and therefore di�erent Support Vector sets. It is possibleto conceive of problems where all patterns do lie on the decision boundary, yet only

References

Page 117: A quick introduction to Statistical Learning Theory and Support Vector Machines

60 CHAPTER 2. SUPPORT VECTOR MACHINESTABLE 2.9: Percentage of the Support Vector set of [column] contained in the support setof [row]; for the binary recognizers for digit class 7 (bottom) (USPS set). The training setsfor the classi�ers in [row] and [column] were permuted with respect to each other (controlexperiment for Table 2.6); still, the overlap between the SV sets persists.Polynomial RBF SigmoidPolynomial 92 82 90RBF 88 92 84Sigmoid 91 84 93a few of them are necessary at a time for expressing the decision function. In such acase, the actual SV set extracted could strongly depend on the ordering of the trainingset, especially if the quadratic programming algorithm processes the data in chunks.In our experiments, we did use the same ordering of the training set in all three cases.To exclude the possibility that it is this ordering that causes the reported overlaps, weran a control experiment where two classi�ers with the same kernel were trained twice,on the original training set, and on a permuted version of it, respectively. We foundthat the two cases produced highly overlapping (to around 90%) SV sets, which meansthat the training set ordering does hardly have an e�ect on the SV sets extracted |it only changes around 10% of the SV sets. In addition, repeating the experiments ofTable 2.6 on permuted training sets gave results consistent with this �nding: Table 2.9shows that the overlap between SV sets of di�erent classi�ers is hardly changed whenone of the training sets is permuted. We may also add that the overlap is not due toSVs corresponding to errors on the training set (cf. (2.18), with �i > 1): the consideredclassi�ers had very few training errors.Using a leave-one-out procedure similar to Proposition 2.1.2, Vapnik and Watkinshave subsequently put forward a theoretical argument for shared SVs. We state itin the following form: If the SV set of three SV classi�ers had no overlap, we couldobtain a fourth classi�er which has zero test error.To see why this is the case, note that if a pattern is left out of the training set,it will always be classi�ed correctly by voting between the three SV classi�ers trainedon the remaining examples: otherwise, it would have been been an SV of at least twoof them, if kept in the training set. The expectation of the number of patterns whichare SVs of at least two of the three classi�ers, divided by the training set size, thusforms an upper bound on the expected test error of the voting system.Training on SV Sets. As described in Sec. 2.1.2, the Support Vector set containsall the information a given classi�er needs for constructing the decision function. Dueto the overlap in the Support Vector sets of di�erent classi�ers, one can even trainclassi�ers on Support Vector sets of another classi�er. Table 2.10 shows that this leadsto results comparable to those after training on the whole database. In Sec. 4.2.1, we

References

Page 118: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.5. COMPARISON TO CLASSICAL RBF NETWORKS 61TABLE 2.10: Training classi�ers on the Support Vector sets of other classi�ers leadsto performances on the test set which are as good as the results for training on the fulldatabase (shown are numbers of errors on the 2007-element test set, for two-class classi�ersseparating digit 7 from the rest). Additionally, the results for training on a random subsetof the database of size 200 are displayed.trained on: poly-SVs rbf-SVs tanh-SVs full db rnd. subs.kernel size: 178 189 177 7291 200Poly 13 13 12 13 23RBF 17 13 17 15 27tanh 15 13 13 15 25will use this �nding as a motivation for a method to make SV machines transformationinvariant.Discussion. Learning can be viewed as inferring regularities from a set of trainingexamples. Much research has been devoted to the study of various learning algorithmswhich allow the extraction of these underlying regularities. No matter how di�erent theoutward appearance of these algorithms is, they all must rely on intrinsic regularitiesof the data. If the learning has been successful, these intrinsic regularities are capturedin the values of some parameters of a learning machine; for a polynomial classi�er,these parameters are the coe�cients of a polynomial, for a neural net they are weightsand biases, and for a radial basis function classi�er they are weights and centers. Thisvariety of di�erent representations of the intrinsic regularities, however, conceals thefact that they all stem from a common root.The Support Vector algorithm enables us to view these algorithms in a uni�edtheoretical framework. The presented empirical results show that di�erent types ofSV classi�ers construct their decision functions from highly overlapping subsets of thetraining set, and thus extract a very similar structure from the observations, whichcan in this sense be viewed as a characteristic of the data: the set of Support Vectors.This �nding may lead to methods for compressing databases signi�cantly by disposingof the data which is not important for the solution of a given task (cf. also Guyon,Mati�c, and Vapnik, 1996).In the next section, we will take a closer look at one of the types of learningmachines implementable by the SV algorithm.2.5 Comparison to Classical RBF NetworksBy using Gaussian kernels (2.27), the SV algorithm can construct learning machineswith a Radial Basis Function (RBF) architecture. In contrast to classical approachesfor training RBF networks, the SV algorithm automatically determines centers, weights

References

Page 119: A quick introduction to Statistical Learning Theory and Support Vector Machines

62 CHAPTER 2. SUPPORT VECTOR MACHINESand threshold that minimize an upper bound on the expected test error. The presentsection is devoted to an experimental comparison of these machines with a classicalapproach, where the centers are determined by k-means clustering and the weights arecomputed using error backpropagation. We consider three machines, namely a classicalRBF machine, an SV machine with Gaussian kernel, and a hybrid system with the cen-ters determined by the SV method and the weights trained by error backpropagation.Our results show that on the US postal service database of handwritten digits, the SVmachine achieves the highest recognition accuracy, followed by the hybrid system.Copyright notice: the material in this section is based on the article \Comparingsupport vector machines with Gaussian kernels to radial basis function classi�ers" byB. Sch�olkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio and V. Vapnik, whichappeared in IEEE Transactions on Signal Processing; 45(11): 2758 { 2765, November1997. IEEE.2.5.1 Di�erent Ways of Training RBF Classi�ersConsider Fig. 2.12. Suppose we want to construct a radial basis function classi�erf(x) = sgn Xi=1wi exp �kx� xik2ci !+ b! (2.40)(b and ci being constants, the latter positive) separating balls from circles, i.e. takingdi�erent values on balls and circles. How do we choose the centers xi? Two extremecases are conceivable:The �rst approach consists of choosing the centers for the two classes separately,irrespective of the classi�cation task to be solved. The classical technique of �nding thecenters by some clustering technique (before tackling the classi�cation problem) is suchan approach. The weights wi are then usually found by either error backpropagation(Rumelhart, Hinton, and Williams, 1986) or the pseudo-inverse method (Poggio andGirosi, 1990).An alternative approach (Fig. 2.13) consists of choosing as centers points which arecritical for the classi�cation task at hand. The Support Vector algorithm implementsthe latter idea. By simply choosing a suitable kernel function (2.27), it allows theconstruction of radial basis function classi�ers. The algorithm automatically computesthe number and location of the above centers, the weights wi, and the threshold b.By the kernel function, the patterns are mapped nonlinearly into a high-dimensionalspace. There, an optimal separating hyperplane is constructed, expressed in terms ofthose examples which are closest to the decision boundary. These are the SupportVectors which correspond to the centers in input space.The goal of the present section is to compare real-world results obtained withk-means clustering and classical RBF training to those obtained when the centers,weights and threshold are automatically chosen by the Support Vector algorithm. Tothis end, we decided to undertake a performance study by combining expertise on the

References

Page 120: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.5. COMPARISON TO CLASSICAL RBF NETWORKS 63

FIGURE 2.12: A simple 2-dimensional classi�cation problem: �nd a decision functionseparating balls from circles. The box, as in all following pictures, depicts the region[�1; 1]2.Support Vector algorithm (AT&T Bell Laboratories) and on the classical radial basisfunction networks (Massachusetts Institute of Technology).Three di�erent RBF systems took part in the performance comparison:� SV system. A standard SV machine with Gaussian kernel function was con-structed (cf. (2.27)).� Classical RBF system. The MIT side of the performance comparison con-structed networks of the formg(x) = sgn KXi=1wiGi(x) + b!= sgn KXi=1wi 1(2�)N=2�Ni exp �kx� cik22�2i !+ b! ;with the number of centers k identical to the one automatically found by the SValgorithm. The centers ci were computed by k-means clustering (e.g. Duda andHart, 1973), and the weights wi are trained by on-line mean squared error backpropagation.The training procedure constructs ten binary recognizers for the digit classes,with RBF hidden units and logistic outputs, trained to produce the target values

References

Page 121: A quick introduction to Statistical Learning Theory and Support Vector Machines

64 CHAPTER 2. SUPPORT VECTOR MACHINES

FIGURE 2.13: RBF centers automatically computed by the Support Vector algorithm(indicated by extra circles), using ci = 1 for all i (cf. (2.27), (2.40)). The number of SVcenters accidentally coincides with the number of identi�able clusters (indicated by crossesfound by k-means clustering with k = 2 and k = 3 for balls and circles, respectively) butthe naive correspondence between clusters and centers is lost; indeed, 3 of the SV centersare circles, and only 2 of them are balls. Note that the SV centers are chosen with respectto the classi�cation task to be solved.1 and 0 for positive and negative examples, respectively. The networks weretrained without weight decay, however, a bootstrap procedure was used to limittheir complexity. The �nal RBF network for each class contains every Gaussiankernel from its target class, but only several kernels from the other 9 classes,selected such that no false positive mistakes are made. For further details, see(Sung, 1996; Moody and Darken, 1989).� Hybrid system. To assess the relative in uence of the automatic SV centerchoice and the SV weight optimization, respectively, another RBF system wasbuilt, constructed with centers that are simply the Support Vectors arising fromthe SV optimization, and with the weights trained separately using mean squarederror back propagation.Computational Complexity. By construction, the resulting classi�ers after trainingwill have the same architecture and comparable sizes. Thus the three machines arecomparable in classi�cation speed and memory requirements.Di�erences were, however, noticeable in training. Regarding training time, the SV

References

Page 122: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.5. COMPARISON TO CLASSICAL RBF NETWORKS 65

FIGURE 2.14: Two-class classi�cation problem solved by the Support Vector algorithm(ci = 1 for all i; cf. Eq. 2.40).machine was faster than the RBF system by about an order of magnitude.12 Theoptimization, however, requires to work with potentially large matrices. In the im-plementation that we used, the training data is processed in chunks, and matrix sizeswere of the order 500� 500. For problems with very large numbers of SVs, a modi�edtraining algorithm has recently been proposed by Osuna, Freund, and Girosi (1997).Error Functions. Due to the constraints (2.18) and the target function (2.19), theSV algorithm puts emphasis on correctly separating the training data. In this respect,it is di�erent from the classical RBF approach of training in the least-squares metric,which is more concerned with the general problem of estimating posterior probabilitiesthan with directly solving a classi�cation task at hand. There exist, however, studiesinvestigating the question how to select RBF centers or exemplars to minimize thenumber of misclassi�cations, see for instance (Chang and Lippmann, 1993; Duda andHart, 1973; Reilly, Cooper, and Elbaum, 1982; Barron, 1984). A classical RBF systemcould also be made more discriminant by using moving centers (e.g. Poggio and Girosi,1990), or a di�erent cost function, as the classi�cation �gure of merit (Hampshire andWaibel, 1990). In fact, it can be shown that Gaussian RBF regularization networksare equivalent to SV machines if the regularization operator and the cost function arechosen appropriately (Smola and Sch�olkopf, 1997b).12For noisy regression problems, on the other hand, Support Vector machines can be slower (M�ulleret al., 1997).

References

Page 123: A quick introduction to Statistical Learning Theory and Support Vector Machines

66 CHAPTER 2. SUPPORT VECTOR MACHINES

FIGURE 2.15: A simple two-class classi�cation problem as solved by the SV algorithm(ci = 1 for all i; cf. Eq. 2.40). Note that the RBF centers (indicated by extra circles)are closest to the decision boundary. Interestingly, the decision boundary is a straight line,even though a nonlinear Gaussian RBF kernel was used. This is due to the fact that onlytwo SVs are required to solve the problem. The translational and unitary invariance of theRBF kernel then renders the situation completely symmetric.It is important to stress that the SV machine does not minimize the empirical risk(misclassi�cation error on the training set) alone. Instead it minimizes the sum of anupper bound on the empirical risk and a penalty term that depends on the complexityof the classi�er used.2.5.2 Toy Examples: What are the Support Vectors?Support Vectors are elements of the data set that are \important" in separating thetwo classes from each other. In general, the SVs with zero slack variables (2.17) lie onthe boundary of the decision surface, as they precisely satisfy the inequality (2.18) inthe high-dimensional space. Figures 2.15 and 2.14 illustrate that for the used Gaussiankernel, this is also the case in input space. This raises an interesting question from thepoint of view of interpreting the structure of trained RBF networks. The traditionalview of RBF networks has been one where the centers were regarded as \templates"or stereotypical patterns. It is this point of view that leads to the clustering heuristicfor training RBF networks. In contrast, the SV machine posits an alternate point ofview, with the centers being those examples which are critical for a given classi�cationtask.

References

Page 124: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.5. COMPARISON TO CLASSICAL RBF NETWORKS 67TABLE 2.11: Numbers of centers (Support Vectors) automatically extracted by the Sup-port Vector machine. The �rst row gives the total number for each binary classi�er,including both positive and negative examples; in the second row, we only counted thepositive SVs. The latter number was used in the initialization of the k-means algorithm,cf. Sec. 2.5.digit class 0 1 2 3 4 5 6 7 8 9# of SVs 274 104 377 361 334 388 236 235 342 263# of pos. SVs 172 77 217 179 211 231 147 133 194 166TABLE 2.12: Two-class-classi�cation: numbers of test errors (out of 2007 test patterns)for the three systems described in Sec. 2.5.digit class 0 1 2 3 4 5 6 7 8 9classical RBF 20 16 43 38 46 31 15 18 37 26RBF with SV centers 9 12 27 24 32 24 19 16 26 16full SV machine 16 8 25 19 29 23 14 12 25 162.5.3 Handwritten Digit RecognitionWe used the USPS database of handwritten digits (Appendix C). The SV machineresults reported in the following were obtained with our default choice = 10 (cf.(2.19), Sec. 2.3), and c = 0:3 �N (cf. (2.27)), where N = 256 is the dimensionality ofinput space.13Two-class classi�cation. Table 2.11 shows the numbers of Support Vectors, i.e.RBF centers, extracted by the SV algorithm. Table 2.12 gives the results of binaryclassi�ers separating single digits from the rest, for the systems described in Sec. 2.5.Ten-class classi�cation. For each test pattern, the arbitration procedure in all threesystems simply returns the digit class whose recognizer gives the strongest response(cf. (2.38)). Table 2.13 shows the 10-class digit recognition error rates for our originalsystem and the two RBF-based systems.The fully automatic SV machine exhibits the highest test accuracy of the threesystems.14 Using the SV algorithm to choose the centers for the RBF network is alsobetter than the baseline procedure of choosing the centers by a clustering heuristic asdescribed above. It can be seen that in contrast to the k-means cluster centers, thecenters chosen by the SV algorithm allow zero training error rates.The considered recognition task is known to be rather hard | the human error rate13The SV machine is rather insensitive to di�erent choices of c. For all values in 0:1; 0:2; : : : ; 1:0,the performance is about the same (in the area of 4% { 4.5%).14An analysis of the errors showed that about 85% of the errors committed by the SV machinewere also made by the other systems. This makes the di�erences in error rates very reliable.

References

Page 125: A quick introduction to Statistical Learning Theory and Support Vector Machines

68 CHAPTER 2. SUPPORT VECTOR MACHINESTABLE 2.13: 10-class digit recognition error rates for three RBF classi�ers constructedwith di�erent algorithms. The �rst system is a classical one, choosing its centers by k-means clustering. In the second system, the Support Vectors were used as centers, and inthe third one, the entire network was trained using the Support Vector algorithm.Classi�cation Error RateUSPS DB classical RBF RBF with SV centers full SV machineTraining 1.7% 0.0% 0.0%Test 6.7% 4.9% 4.2%is 2.5% (Bromley and S�ackinger, 1991), almost matched by a memory-based Tangent-distance classi�er (2.6%, Simard, LeCun, and Denker, 1993). Other results on thisdatabase include a Euclidean distance nearest neighbour classi�er (5.9%, Simard,LeCun, and Denker, 1993), a perceptron with one hidden layer, 5.9%, and a convolu-tional neural network (5.0%, LeCun et al., 1989). By incorporating translational androtational invariance using the Virtual SV technique (see below, Sec. 4.2.1), we wereable to improve the performance of the considered Gaussian kernel SV machine (samevalues of and c) from 4.2% to 3.2% error.2.5.4 Summary and DiscussionThe Support Vector algorithm provides a principled way of choosing the number andthe locations of RBF centers. Our experiments on a real-world pattern recognitionproblem have shown that in contrast to a corresponding number of centers chosen byk-means, the centers chosen by the Support Vector algorithm allowed a training errorof zero, even if the weights were trained by classical RBF methods. The interpreta-tion of this �nding is that the Support Vector centers are speci�cally chosen for theclassi�cation task at hand, whereas k-means does not care about picking those centerswhich will make a problem separable.In addition, the SV centers yielded lower test error rates than k-means. It isinteresting to note that using SV centers, while sticking to the classical procedure fortraining the weights, improved training and test error rates by approximately the sameamount (2%). In view of the guaranteed risk bound (1.5), this can be understood inthe following way: the improvement in test error (risk) was solely due to the lowervalue of the training error (empirical risk); the con�dence term (the second term on theright hand side of (1.5)), depending on the VC-dimension and thus on the norm of theweight vector, did not change, as we stuck to the classical weight training procedure.However, when we also trained the weights with the Support Vector algorithm, weminimized the norm of the weight vector (see (2.19), (2.4)) in feature space, and thusthe con�dence term, while still keeping the training error zero. Thus, consistent with(1.5), the Support Vector machine achieved the highest test accuracy of the three

References

Page 126: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.6. MODEL SELECTION 69systems.152.6 Model Selection2.6.1 Choosing Polynomial Degrees16In the case where the available amount of training data is limited, it is important tohave a means for achieving the best possible generalization by controlling character-istics of the learning machine, without having to put aside parts of the training setfor validation purposes. One of the strengths of SV machines consists in the auto-matic capacity tuning, which was related to the fact that the minimization of (2.19)is connected to structural risk minimization, based on the bound (1.5). This capacitytuning takes place within a set of functions speci�ed a priori by the choice of a kernelfunction. In the following, we go one step further and use the bound (1.5) also topredict the kernel degree which yields the best generalization for polynomial classi�ers(Sch�olkopf, Burges, and Vapnik, 1995).Since for SV machines we have an upper bound on the VC-dimension (Propo-sition 2.1.1), we can use (1.5) to get an upper bound on the expected error on anindependent test set in terms of the training error and the value of kwk (or, equiva-lently, the margin 2=kwk). This bound can then be used to try to choose parameters ofthe learning machines such that the test error gets minimal, without actually lookingat the test set.We consider polynomial classi�ers with the kernel (2.26), varying their degree d,and make the assumption that the bound (2.4) gives a reliable indication of the actualVC-dimension, i.e. that the VC-dimension can be estimated byh � c1hest: = R2kwk2 (2.41)with some c1 < 1 which is independent of the polynomial degree.For the USPS digit recognition problem, training errors are very small. In thatcase, the right hand side of the bound (1.5) is dominated by the con�dence term, whichis minimized when the VC-dimension is minimized. For the latter, we use (2.41), with15Two remarks on the interpretation of our �ndings are in order. The �rst result, comparing theerror rates of the classical and the hybrid system, does not necessarily rule out the possibility ofreducing the training error also for k-means centers by using di�erent cost functions or codings of theoutput units. It should be considered as a statement comparing two sets of centers, using the sameweight training algorithm to build RBF networks from them. Along similar lines, the second result,indicating the superior performance of the full SV RBF system, refers to the systems as describedin this study. It does not rule out the possibility of improving classical RBF systems by suitablemethods of complexity control. Indeed, the results for the SV RBF system do show that using thesame architecture, but di�erent weight training procedures, can signi�cantly improve results.16Copyright notice: the material in this section is based on the article \Extracting support datafor a given task" by B. Sch�olkopf, C. Burges and V. Vapnik, which appeared in: Proceedings, FirstInternational Conference on Knowledge Discovery & Data Mining, pp. 252 { 257, 1995. AIII Press.

References

Page 127: A quick introduction to Statistical Learning Theory and Support Vector Machines

70 CHAPTER 2. SUPPORT VECTOR MACHINES

2 3 4 5 6 7degree:

2000

1000300

174200

average VC-dim. total no. of test errors

FIGURE 2.16: Average VC-dimension (solid) and total number of test errors of the tentwo-class-classi�ers (dotted) for polynomial degrees 2 through 7 (for degree 1, Remp iscomparably big, so the VC-dimension alone is not su�cient for predicting R, cf. (1.5)).The baseline on the error scale, 174, corresponds to the total number of test errors ofthe ten best binary classi�ers out of the degrees 2 through 7. The graph shows that theVC-dimension allows us to predict that degree 4 yields the best overall performance of thetwo-class-classi�ers on the test set. This is not necessarily the case for the performances ofthe ten-class classi�ers, which are built from the two-class-classi�er outputs before applyingthe sgn functions (cf. Sec. 2.1.6).kwk determined by the Support Vector algorithm (note that kwk is computed infeature space, using the kernel). Thus, in order to compute hest:, we need to computeR, the radius of the smallest sphere enclosing the training data in feature space.This can be reduced to a quadratic programming problem similar to the one used inconstructing the optimal hyperplane:17We formulate the problem as follows:Minimize R2 subject to kzi � z�k2 � R2; (2.42)where z� is the (to be determined) center of the sphere. We use the LagrangianR2 �Xi �i(R2 � (zi � z�)2); (2.43)and compute the derivatives by z� and R to getz� =Xi �izi (2.44)and the Wolfe dual problem: maximizeXi �i � (zi � zi)�Xi;j �i�j � (zi � zj) (2.45)17The following derivation is due to Chris Burges.

References

Page 128: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.6. MODEL SELECTION 71subject to Xi �i = 1; �i � 0; (2.46)where the �i are Lagrange multipliers. As in the Support Vector algorithm, thisproblem has the property that the zi appear only in dot products, so as before onecan compute the dot products in feature space, replacing (zi � zj) by k(xi;xj) (wherethe xi live in input space, and the zi in feature space).In this way, we compute the radius of the minimal enclosing sphere for all the USPStraining data for polynomial classi�ers of degrees 1 through 7. For the same degrees,we then train a binary polynomial classi�er for each digit. Using the obtained valuesfor hest:, we can predict, for each digit, which degree polynomial will give the bestgeneralization performance. Clearly, this procedure is contingent upon the validity ofthe assumption that c1 is approximately the same for all degrees. We can then comparethis prediction with the actual polynomial degree which gives the best performanceon the test set. The results are shown in Table 2.14; cf. also Fig. 2.16.TABLE 2.14: Performance of the classi�ers with degree predicted by the VC-bound. Eachrow describes one two-class-classi�er separating one digit (stated in the �rst column) fromthe rest. The remaining columns contain: deg: the degree of the best polynomial aspredicted by the described procedure, param.: the dimensionality of the high dimensionalspace, which is also the VC-dimension for the set of all separating hyperplanes in thatspace, hest:: the VC-dimension estimate for the actual classi�ers, which is much smallerthan the number of free parameters of linear classi�ers in that space, 1 { 7: the numbers oferrors on the test set for polynomial classi�ers of degrees 1 through 7. The table shows thatthe decribed procedure chooses polynomial degrees which are optimal or close to optimal.chosen classi�er errors on the test set for degrees 1 { 7digit deg param. hest: 1 2 3 4 5 6 70 3 2:8 � 106 547 36 14 11 11 11 12 171 7 1:5 � 1013 95 17 15 14 11 10 9 102 3 2:8 � 106 832 53 32 28 26 28 27 323 3 2:8 � 106 1130 57 25 22 22 22 22 234 4 1:8 � 108 977 50 32 32 30 30 29 335 3 2:8 � 106 1117 37 20 22 24 24 26 286 4 1:8 � 108 615 23 12 12 15 17 17 197 5 9:5 � 109 526 25 15 12 10 11 13 148 4 1:8 � 108 1466 71 33 28 24 28 32 349 5 9:5 � 109 1210 51 18 15 11 11 12 15

References

Page 129: A quick introduction to Statistical Learning Theory and Support Vector Machines

72 CHAPTER 2. SUPPORT VECTOR MACHINESThe above method for predicting the optimal classi�er functions gives good results.In four cases, the theory predicted the correct degree; in the other cases, the predicteddegree gave performance close to the best possible one.2.6.2 The Choice of the Regularization Parameter In addition to kernel parameters as the polynomial degree, there is another parameterwhose value needs to be set for SV training: the regularization constant , determiningthe trade-o� between minimizing training error and controlling complexity (cf. (2.19)).The optimal value of should depend both on characteristics of the problem at handand on the sample size. Although our experience suggests that for problems with littlenoise, the results are reasonably insensitive with respect to changes of , it would stillbe desirable to have a principled method for choosing . The remainder of this sectionis an attempt at developing such a method.As in the last section, the starting point is the risk bound (1.5). The idea isto adjust such that minimization of the SV objective function (2.19) amounts tominimizing (1.5). As the solution w depends on the value of chosen (in (2.19)), wecannot use (1.5) and (2.4) to determine the value of a priori. Instead, we will resortto an iterative strategy.Following (2.4), we write h = c1R2kwk2 (2.47)with some c1 < 1. Substituting this into the bound (1.5), (1.6), we obtainvuutc1R2kwk2 �log 2`c1R2kwk2 + 1�� log(�=4)` +Remp(�) (2.48)as an upper bound on the risk that we want to minimize.Remembering that Pi �i is an upper bound on the number of training errors, weadditionally write Xi �i = c2`Remp(w); (2.49)with some c2 � 1, where ` is the number of training examples. Minimizing the objectivefunction (2.19) then amounts to minimizing12 c2`kwk2 +Remp(w): (2.50)Identifying w with the function index � in (2.48), we now have a formulation wherethe second terms of (2.50) and (2.48) are identical. The �rst terms cannot coincidein general: unlike the �rst term of (2.48), kwk2=(2 c2`) is proportional to kwk2.However, a necessary condition to ensure that the minimum of the function that we

References

Page 130: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.6. MODEL SELECTION 73are minimizing, (2.50), is close to that of the one that we would like to minimize, (2.48),is that the gradients of the �rst terms with respect to w coincide at the minimum.Hence, we have to choose such that1 c2`w = c1R2(log 2`c1R2kwk2 � 1)q`(c1R2kwk2 log 2`c1R2kwk2 � log(�=4))w: (2.51)For w 6= 0, we thus obtain = q`(c1R2kwk2 log 2`c1R2kwk2 � log(�=4))c1c2`R2(log 2`c1R2kwk2 � 1) : (2.52)Eq. (2.52) establishes a relationship between and w at the point of the solution. Ifwe start with a non-optimal value of , however, we will obtain a non-optimal w andthus (2.52) will not tell us exactly how to adjust . For instance, suppose we startwith a which is too big. Then too much weight will be put on minimizing empiricalrisk (cf. (2.19)), and the margin will become too small, i.e. w will become too big. Wewill resort to the following method: we use (2.52) to determine a new value 0, anditerate the procedure.The value of kwk2 is obtained by solving the SV quadratic programming prob-lem (in feature space, we have kwk2 = Pij yiyj�i�jk(xi;xj)); R2 is computed as inSec. 2.6.1, and c2 is obtained fromPi �i and the training error using (2.49). The valuesof c1 and � must be chosen by the user. The constant c1 characterizes the tightness ofthe VC-dimension bound (cf. (2.47)), and 1 � � is a lower bound on the probabilitywith which the risk bound (2.48) holds. As long as � is not too close to 0, it doeshardly a�ect our procedure. The value of c1 is more di�cult to choose correctly, how-ever, reasonable results can already be obtained with the default choice c1 = 1, as weshall see below.Statements on the convergence of this procedure are hard to obtain: to computethe mapping from to 0, we have to train an SV machine and then evaluate (2.52),thus one cannot compute its derivative in closed form. In an empirical study to bedecribed presently, the procedure exhibited well-behaved convergence behaviour. Inthe experiment, we used the small MNIST database (Appendix C). We found thatthe iteration converged no matter whether we started with a very large or with atiny . In the following, we report results obtained when starting with a huge value of , which e�ectively leads to the construction of an optimal margin classi�er (i.e. withzero training error | cf. Sec. 2.1.2: for =1, (2.22) reduces to (2.14)).Table 2.15 shows partly encouraging results. For all 10 binary digit recognizers,the iteration converges very fast (about two steps were required to reach a value veryclose to the �nal one). In seven cases, the number of test errors decreased, in onlytwo case did it increase. By combining the resulting binary classi�ers, we obtained a10-class classi�er with an error rate (on the 10000 element small MNIST test set) of3:9%, slightly better than the error rate obtained both with the starting value used in

References

Page 131: A quick introduction to Statistical Learning Theory and Support Vector Machines

74 CHAPTER 2. SUPPORT VECTOR MACHINESthe iteration, = 1010, and with our default choice = 10: in these cases, we obtained4:0% error (cf. below, in Table 4.6).Clearly, further experiments are necessary to validate or improve this method. Inparticular, it would be interesting to study a noisy classi�cation problem, where thechoice of should potentially have a greater impact on the quality of the solution.We conclude this section with a note on the relationship of the model selectionmethods described in sections 2.6.1 and 2.6.2. Both of the proposed methods arebased on the bound (1.5). In principle, we could also apply the method of 2.6.1 forchoosing . In that case, we would try out a series of values of , and pick the onewhich minimizes (1.5). The advantage of the present method, however, is that it doesnot require scanning a whole range of values. Instead, is chosen such that, with thehelp of a few iterations, the SV optimization automatically minimizes (1.5) over ,in addition to the built-in minimization over the weights of the SV machine (cf. theremarks at the beginning of Sec. 2.6.1).18TABLE 2.15: Iterative choice of the regularization constant (cf. Sec. 2.6.2) for all tendigit recognizers on the small MNIST database. Each table shows SV machine parametersand results for the starting point ( = 1010), and for �ve subsequent iteration steps. In allcases, we used c1 = 1, � = 0:2 (corresponding to a risk bound holding with probability ofat least 0.8), and a polynomial classi�er of degree 5. The constant c2 is unde�ned beforethe �rst run of the algorithm. After each run, it is computed using (2.49); if Remp is 0, weset c2 = 1. For = 1010, we are e�ectively computing an optimal separating hyperplane,with zero training errors. The iteration converges very fast; moreover, in seven of the tencases, it reduced the number of test errors (in two cases, the opposite happened). c2 train. errors test errors # of SVs kwk1010 - 0 38 177 36.1560.723763 1 3 32 187 29.460digit 0 0.052130 10.0 3 32 194 29.2880.050580 10.3 3 31 194 29.3120.047947 10.8 3 30 192 29.4160.051618 10.1 3 31 188 29.316 c2 train. errors test errors # of SVs kwk1010 - 0 33 141 48.9981.248717 1 3 30 153 34.286digit 1 0.047532 14.0 3 31 160 34.0630.045677 14.4 3 30 161 33.9880.042151 15.5 3 29 157 34.0540.046176 14.2 3 31 154 34.03818We performed another set of experiments to �nd out whether the leave-one-out generalizationbound (Proposition 2.1.2) could be used for selecting . On the small MNIST database, the resultswere negative, leading to values of which were too large (in the range of 105 to 1010).

References

Page 132: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.6. MODEL SELECTION 75 c2 train. errors test errors # of SVs kwk1010 - 0 104 340 58.8161.853855 1 4 88 355 49.055digit 2 0.100126 12.5 4 87 354 48.8100.094182 13.2 4 87 352 48.7570.092732 13.3 4 87 352 48.8330.095662 13.0 4 88 351 48.906 c2 train. errors test errors # of SVs kwk1010 - 0 96 377 70.4553.047037 1 5 93 392 57.387digit 3 0.139778 12.5 5 93 397 56.8600.122975 13.9 5 94 387 57.1430.142765 12.1 5 96 385 56.8680.123462 13.9 5 93 383 57.272 c2 train. errors test errors # of SVs kwk1010 - 0 74 282 66.7862.586497 1 5 79 312 52.781digit 4 0.134502 10.8 6 77 313 52.4090.150066 9.6 6 77 312 52.3740.152267 9.4 6 77 313 52.3240.147880 9.7 6 77 311 52.338 c2 train. errors test errors # of SVs kwk1010 - 0 87 339 65.0512.400509 1 4 101 358 53.578digit 5 0.116545 12.9 5 99 353 53.1820.126663 11.7 5 99 362 53.2630.129597 11.5 5 98 358 53.2500.126985 11.7 5 101 363 53.272 c2 train. errors test errors # of SVs kwk1010 - 0 80 231 46.8571.144632 1 2 80 260 37.923digit 6 0.037990 20.6 2 80 264 37.6230.037135 20.8 2 78 256 37.7730.039888 19.5 2 78 253 37.8340.041169 19.0 2 79 258 37.771 c2 train. errors test errors # of SVs kwk1010 - 0 122 253 69.7162.945887 1 9 109 272 48.730digit 7 0.187035 6.6 10 109 271 48.2630.196960 6.2 10 108 278 48.1040.189135 6.4 10 108 270 48.2410.197877 6.1 10 108 272 48.258

References

Page 133: A quick introduction to Statistical Learning Theory and Support Vector Machines

76 CHAPTER 2. SUPPORT VECTOR MACHINES c2 train. errors test errors # of SVs kwk1010 - 0 127 440 77.1674.246777 1 3 126 473 63.982digit 8 0.102221 22.4 5 126 463 63.5900.156304 14.4 5 122 464 63.7030.165320 13.7 5 126 457 63.6050.166110 13.6 5 127 462 63.492 c2 train. errors test errors # of SVs kwk1010 - 0 146 412 94.99721.34817 1 2 146 446 74.464digit 9 0.103062 35.8 11 140 437 71.8220.417387 7.8 10 137 434 71.8930.383747 8.5 11 141 435 71.8570.439163 7.4 11 139 440 71.9002.7 Why Do SV Machines Work Well?The presented experimental results show that Support Vector machines obtain highaccuracies which are competitive with state-of-the-art techniques. This was true forseveral visual recognition tasks. Care should be exercised, however, when generalizingthis statement to other types of pattern recognition tasks. There, empirical studieshave yet to be carried out, in particular since the tasks that we considered were allcharacterized by relatively low overlap of the di�erent classes (for instance, in theUSPS task, the human error rate is around 2.5%). In any case, the results obtainedhere are encouraging, in particular when taking into account that the SV algorithmwas developed only recently. Below, we summarize di�erent aspects providing insightin why SV machines generalize well:Capacity Control. The kernel method allows to reduce a large class of learning ma-chines to separating hyperplanes in some space. For those, an upper bound on theVC-dimension can be given (Proposition 2.1.1). As argued in Sec. 2.1.3, minimizingthe SV objective function (2.19) corresponds to trying to separate the data with aclassi�er of low VC-dimension, thereby approximately performing structural risk min-imization. The problem of constructing the decision function requires minimizing apositive quadratic form subject to box constraints and can thus be solved e�ciently.As we saw, low VC-dimension is related to a large separation margin. Thus, analy-ses of the generalization performance in terms of separation margins and fat shatteringdimension also bear relevance to SV machines (e.g. Schapire, Freund, Bartlett, andLee, 1997; Shawe-Taylor, Bartlett, Williamson, and Anthony, 1996; Bartlett, 1997).Compression. The leave-one-out bound (Proposition 2.1.2) relates SV generalizationability to the fact that the decision function is expressed in terms of a (possibly small)

References

Page 134: A quick introduction to Statistical Learning Theory and Support Vector Machines

2.7. WHY DO SV MACHINES WORK WELL? 77subset of the data. This can be viewed in the context of Algorithmic Complexity andMinimum Description Length (Vapnik, 1995b, Chapter 5, footnote 6).Regularization. In (Smola and Sch�olkopf, 1997b), a regularization framework is pro-posed which contains the SV algorithm as a special case. For kernel-based functionexpansions, it is shown that given a regularization operator P (Tikhonov and Arsenin,1977) mapping the functions of the learning machine into some dot product space D,the problem of minimizing the regularized riskRreg[f ] = Remp[f ] + �kPfk2; (2.53)(with a regularization parameter � � 0) can be written as a constrained optimizationproblem. For particular choices of the cost function, it further reduces to a SV typequadratic programming problem. The latter thus is not speci�c to SV machines, butis common to a much wider class of approaches. What gets lost in this case, however,is the fact that the solution can usually be expressed in terms of a small numberof SVs. This speci�c feature of SV machines is due to the fact that the type ofregularization and the class of functions which are considered as admissible solutionsare intimately related (cf. Poggio and Girosi, 1990; Girosi, Jones, and Poggio, 1993;Smola and Sch�olkopf, 1997a; Smola, Sch�olkopf, and M�uller, 1997): the SV algorithmis equivalent to minimizing the regularized risk on the set of functionsf(x) =Xi �ik(xi;x) + b; (2.54)provided that k and P are interrelated byk(xi;xj) = ((Pk)(xi; :) � (Pk)(xj; :)) : (2.55)(Here, (Pk)(xi; :) denotes the result of applying P to the function obtained by �xingk's �rst argument to xi.) To this end, k is chosen as Green's function of P �P . For in-stance, an RBF kernel thus corresponds to regularization with a functional containinga speci�c di�erential operator (Yuille and Grzywacz, 1988).References

Page 135: A quick introduction to Statistical Learning Theory and Support Vector Machines

78 CHAPTER 2. SUPPORT VECTOR MACHINES

References

Page 136: A quick introduction to Statistical Learning Theory and Support Vector Machines

Chapter 3Kernel Principal Component AnalysisIn the last chapter, we tried to show that the idea of implicitely mapping the datainto a high-dimensional feature space has been a very fruitful one in the context ofSV machines. Indeed, it is mainly this feature which distinguishes them from theGeneralized Portrait algorithm which has been known for more than 20 years (Vapnikand Chervonenkis, 1974), and which makes them applicable to complex real-worldproblems which are not linearly separable. Thus, it was natural to ask the questionwhether the same idea could prove fruitful in other domains of learning.The present chapter proposes a new method for performing a nonlinear form ofPrincipal Component Analysis. By the use of Mercer kernels, one can e�ciently com-pute principal components in high-dimensional feature spaces, related to input spaceby some nonlinear map. We give the derivation of the method and present experimen-tal results on polynomial feature extraction for pattern recognition (Sch�olkopf, Smola,and M�uller, 1996b, 1997b).Copyright notice: the material in this chapter is based on the article \Nonlinearcomponent analysis as a kernel Eigenvalue problem" by B. Sch�olkopf, A. Smola andK.-R. M�uller, which will appear in: Neural Computation, 1997. MIT Press.3.1 IntroductionPrincipal Component Analysis (PCA) is a powerful technique for extracting struc-ture from possibly high-dimensional data sets. It is readily performed by solving anEigenvalue problem, or by using iterative algorithms which estimate principal compo-nents. For reviews of the existing literature, see Jolli�e (1986) and Diamantaras &Kung (1996); some of the classical papers are due to Pearson (1901); Hotelling (1933);Karhunen (1946). PCA is an orthogonal transformation of the coordinate system inwhich we describe our data. The new coordinate values by which we represent the dataare called principal components. It is often the case that a small number of principalcomponents is su�cient to account for most of the structure in the data. These aresometimes called factors or latent variables of the data.The present work studies PCA in the case where we are not interested in prin-cipal components in input space, but rather in principal components of variables, or79

References

Page 137: A quick introduction to Statistical Learning Theory and Support Vector Machines

80 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSISfeatures, which are nonlinearly related to the input variables. Among these are for in-stance variables obtained by taking arbitrary higher-order correlations between inputvariables. In the case of image analysis, this amounts to �nding principal componentsin the space of products of input pixels.To this end, we are computing dot products in feature space by means of kernelfunctions in input space (cf. Sec. 1.3). Given any algorithm which can be expressedsolely in terms of dot products, i.e. without explicit usage of the variables themselves,this kernel method enables us to construct di�erent nonlinear versions of it. Eventhough this general fact was known (Burges, private communication), the machinelearning community has made little use of it, the exception being Vapnik's SupportVector machines (Chapter 2). In this chapter, we give an example of applying thismethod in the domain of unsupervised learning, to obtain a nonlinear form of PCA.In the next section, we will �rst review the standard PCA algorithm. In order tobe able to generalize it to the nonlinear case, we formulate it in a way which usesexclusively dot products. Using kernel representations of dot products (Sec. 1.3),Sec. 3.3 presents a kernel-based algorithm for nonlinear PCA and explains some of thedi�erences to previous generalizations of PCA. First experimental results on kernel-based feature extraction for pattern recognition are given in Sec. 3.4. We concludewith a discussion (Sec. 3.5). Some technical material which is not essential for themain thread of the argument has been relegated to Appendix D.2.3.2 Principal Component Analysis in Feature SpacesGiven a set of centered observations xk 2 RN , k = 1; : : : ;M , PMk=1 xk = 0, PCAdiagonalizes the covariance matrix1C = 1M MXj=1xjx>j : (3.1)To do this, one has to solve the Eigenvalue equation�v = Cv (3.2)for Eigenvalues � � 0 and Eigenvectors v 2 RNnf0g. As�v = Cv = 1M MXj=1(xj � v)xj; (3.3)all solutions v with � 6= 0 must lie in the span of x1 : : :xM , hence (3.2) in that case isequivalent to �(xk � v) = (xk � Cv) for all k = 1; : : : ;M: (3.4)1More precisely, the covariance matrix is de�ned as the expectation of xx>; for convenience, weshall use the same term to refer to the estimate (3.1) of the covariance matrix from a �nite sample.

References

Page 138: A quick introduction to Statistical Learning Theory and Support Vector Machines

3.2. PRINCIPAL COMPONENT ANALYSIS IN FEATURE SPACES 81In the remainder of this section, we describe the same computation in another dotproduct space F , which is related to the input space by a possibly nonlinear map� : RN ! F; x 7! X: (3.5)Note that the feature space F could have an arbitrarily large, possibly in�nite, dimen-sionality. Here and in the following, upper case characters are used for elements of F ,while lower case characters denote elements of RN .Again, we assume that we are dealing with centered data, PMk=1 �(xk) = 0 | weshall return to this point later. In F , the covariance matrix takes the form�C = 1M MXj=1�(xj)�(xj)>: (3.6)Note that if F is in�nite-dimensional, we think of �(xj)�(xj)> as a linear operator onF , mapping X 7! �(xj)(�(xj) �X): (3.7)We now have to �nd Eigenvalues � � 0 and Eigenvectors V 2 Fnf0g satisfying�V = �CV: (3.8)Again, all solutions V with � 6= 0 lie in the span of �(x1); : : : ;�(xM). For us, this hastwo useful consequences: �rst, we may instead consider the set of equations�(�(xk) �V) = (�(xk) � �CV) for all k = 1; : : : ;M; (3.9)and second, there exist coe�cients �i (i = 1; : : : ;M) such thatV = MXi=1 �i�(xi): (3.10)Combining (3.9) and (3.10), we get� MXi=1 �i(�(xk) � �(xi)) = 1M MXi=1 �i0@�(xk) � MXj=1�(xj)(�(xj) ��(xi))1Afor all k = 1; : : : ;M: (3.11)De�ning an M �M matrix K byKij := (�(xi) � �(xj)); (3.12)this reads M�K� = K2�; (3.13)

References

Page 139: A quick introduction to Statistical Learning Theory and Support Vector Machines

82 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSISwhere � denotes the column vector with entries �1; : : : ; �M . To �nd solutions of (3.13),we solve the Eigenvalue problem M�� = K� (3.14)for nonzero Eigenvalues. In Appendix D.2.1, we show that this gives us all solutionsof (3.13) which are of interest for us.Let �1 � �2 � : : : � �M denote the Eigenvalues of K (i.e. the solutions M�of (3.14)), and �1; : : : ;�M the corresponding complete set of Eigenvectors, with �pbeing the �rst nonzero Eigenvalue (assuming that � is not identically 0). We normalize�p; : : : ;�M by requiring that the corresponding vectors in F be normalized, i.e.(Vk �Vk) = 1 for all k = p; : : : ;M: (3.15)By virtue of (3.10) and (3.14), this translates into a normalization condition for�p; : : : ;�M : 1 = MXi;j=1�ki �kj (�(xi) ��(xj)) = MXi;j=1�ki �kjKij= (�k �K�k) = �k(�k ��k) (3.16)For the purpose of principal component extraction, we need to compute projectionsonto the Eigenvectors Vk in F (k = p; : : : ;M). Let x be a test point, with an image�(x) in F , then (Vk � �(x)) = MXi=1 �ki (�(xi) � �(x)) (3.17)may be called its nonlinear principal components corresponding to �.In summary, the following steps were necessary to compute the principal compo-nents: �rst, compute the matrix K, second, compute its Eigenvectors and normalizethem in F ; third, compute projections of a test point onto the Eigenvectors.2For the sake of simplicity, we have above made the assumption that the observationsare centered. This is easy to achieve in input space, but more di�cult in F , as wecannot explicitely compute the mean of the mapped observations in F . There is,however, a way to do it, and this leads to slightly modi�ed equations for kernel-basedPCA (see Appendix D.2.2).To conclude this section, note that � can be an arbitrary nonlinear map into thepossibly high-dimensional space F , e.g. the space of all d-th order monomials in theentries of an input vector. In that case, we need to compute dot products of inputvectors mapped by �, with a possibly prohibitive computational cost. The solutionto this problem, however, is to use kernel functions (1.14) | we exclusively need2Note that in our derivation we could have used the known result (e.g. Kirby & Sirovich, 1990)that PCA can be carried out on the dot product matrix (xi � xj)ij instead of (3.1), however, for thesake of clarity and extendability (in Appendix D.2.2, we shall consider the question how to centerthe data in F ), we gave a detailed derivation.

References

Page 140: A quick introduction to Statistical Learning Theory and Support Vector Machines

3.3. KERNEL PRINCIPAL COMPONENT ANALYSIS 83R2

linear PCA

R2

kernel PCA

k

k(x,y) = (x.y)

e.g. k(x,y) = (x.y)d

x

xx xx

x

x

xxx

x

x

x

xx xx

x

x

xxx

x

x

x

x

xx

xx

x

x x

x

x

x

FIGURE 3.1: The basic idea of kernel PCA. In some high-dimensional feature space F(bottom right), we are performing linear PCA, just as a PCA in input space (top). SinceF is nonlinearly related to input space (via �), the contour lines of constant projectionsonto the principal Eigenvector (drawn as an arrow) become nonlinear in input space. Notethat we cannot draw a pre-image of the Eigenvector in input space, as it may not evenexist. Crucial to kernel PCA is the fact that there is no need to perform the map into F :all necessary computations are carried out by the use of a kernel function k in input space(here: R2).to compute dot products between mapped patterns (in (3.12) and (3.17)); we neverneed the mapped patterns explicitely. Therefore, we can use the kernels described inSec. 1.3. The particular kernel used then implicitely determines the space F of allpossible features. The proposed algorithm, on the other hand, is a mechanism forselecting features in F .3.3 Kernel Principal Component AnalysisThe Algorithm. To perform kernel-based PCA (Fig. 3.1), from now on referred toas kernel PCA, the following steps have to be carried out: �rst, we compute thematrix Kij = (k(xi;xj))ij. Next, we solve (3.14) by diagonalizing K, and normalizethe Eigenvector expansion coe�cients �n by requiring �n(�n � �n) = 1. To extractthe principal components (corresponding to the kernel k) of a test point x, we then

References

Page 141: A quick introduction to Statistical Learning Theory and Support Vector Machines

84 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSISΣ (V.Φ(x)) = Σ αi k (xi,x)

input vector x

sample x1, x2, x3,...

comparison: k(xi,x)

feature value

weights (Eigenvector coefficients)α1 α2 α3 α4

k k k k

FIGURE 3.2: Feature extractor constructed by kernel PCA (cf. (3.18)). In the �rst layer,the input vector is compared to the sample via a kernel function, chosen a priori (e.g.polynomial, Gaussian, or sigmoid). The outputs are then linearly combined using weightswhich are found by solving an Eigenvector problem. As shown in the text, the depictednetwork's function can be thought of as the projection onto an Eigenvector of a covariancematrix in a high-dimensional feature space. As a function on input space, it is nonlinear.compute projections onto the Eigenvectors by (cf. Eq. (3.17), Fig. 3.2),(Vn ��(x)) = MXi=1 �ni k(xi;x): (3.18)If we use a kernel satisfying Mercer's conditions (Proposition 1.3.2), we know thatthis procedure exactly corresponds to standard PCA in some high-dimensional featurespace, except that we do not need to perform expensive computations in that space.Properties of Kernel-PCA. For Mercer kernels, we know that we are in fact doinga standard PCA in F . Consequently, all mathematical and statistical properties ofPCA (see for instance Jolli�e, 1986; Diamantaras & Kung, 1996) carry over to kernel-based PCA, with the modi�cations that they become statements about a set of points�(xi); i = 1; : : : ;M , in F rather than in RN . In F , we can thus assert that PCA isthe orthogonal basis transformation with the following properties (assuming that theEigenvectors are sorted in ascending order of the Eigenvalue size):� the �rst q (q 2 f1; : : : ;Mg) principal components, i.e. projections on Eigenvec-tors, carry more variance than any other q orthogonal directions� the mean-squared approximation error in representing the observations by the�rst q principal components is minimal33To see this, in the simple case where the data z1; : : : ; z` are centered, we consider an orthogonal

References

Page 142: A quick introduction to Statistical Learning Theory and Support Vector Machines

3.3. KERNEL PRINCIPAL COMPONENT ANALYSIS 85� the principal components are uncorrelated� the �rst q principal components have maximal mutual information with respectto the inputs (this holds under Gaussianity assumptions, and thus depends onthe particular kernel chosen and on the data)To translate these properties of PCA in F into statements about the data in inputspace, they need to be investigated for speci�c choices of a kernels. We conclude thissection by noting one general property of kernel PCA in input space: for kernels whichdepend only on dot products or distances in input space (as all the examples that wehave given so far do), kernel PCA has the property of unitary invariance. This followsdirectly from the fact that both the Eigenvalue problem and the feature extractiononly depend on kernel values. This ensures that the features extracted do not dependon which orthonormal coordinate system we use for representing our input data.Computational Complexity. As mentioned in Sec. 1.3, a �fth order polynomial ker-nel on a 256-dimensional input space yields a 1010-dimensional feature space. For tworeasons, kernel PCA can deal with this huge dimensionality. First, as pointed out inSect. 3.2 we do not need to look for Eigenvectors in the full space F , but just in thesubspace spanned by the images of our observations xk in F . Second, we do not needto compute dot products explicitely between vectors in F (which can be impossiblein practice, even if the vectors live in a lower-dimensional subspace), as we are usingkernel functions. Kernel PCA thus is computationally comparable to a linear PCAon ` observations with an ` � ` dot product matrix. If k is easy to compute, as forpolynomial kernels, e.g., the computational complexity is hardly changed by the factthat we need to evaluate kernel functions rather than just dot products. Furthermore,in the case where we need to use a large number ` of observations, we may want towork with an algorithm for computing only the largest Eigenvalues, as for instance thepower method with de ation (for a discussion, see Diamantaras & Kung, 1996). Inaddition, we can consider using an estimate of the matrix K, computed from a subsetof M < ` examples, while still extracting principal components from all ` examples(this approach was chosen in some of our experiments described below).The situation can be di�erent for principal component extraction. There, we haveto evaluate the kernel functionM times for each extracted principal component (3.18),basis transformation W , and use the notation Pq for the projector on the �rst q canonical basisvectors fe1; : : : ; eqg. Then the mean squared reconstruction error using q vectors is1Xi kzi �W>PqWzik2 = 1Xi kWzi � PqWzik2 = 1Xi Xj>q(Wzi � ej)2= 1Xi Xj>q(zi �W>ej)2 = 1Xi Xj>q(W>ej � zi)(zi �W>ej) =Xj>q(W>ej � CW>ej):It can easily be seen that the values of this quadratic form (which gives the variances in the directionsW>ej) are minimal if theW>ej are chosen as its (orthogonal) Eigenvectors with smallest Eigenvalues.

References

Page 143: A quick introduction to Statistical Learning Theory and Support Vector Machines

86 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSISrather than just evaluating one dot product as for a linear PCA. Of course, if the di-mensionality of F is 1010, this is still vastly faster than linear principal componentextraction in F . Still, in some cases, e.g. if we were to extract principal componentsas a preprocessing step for classi�cation, we might want to speed up things. This canbe carried out by the reduced set technique of Burges (1996) (cf. Appendix D.1.1),proposed in the context of Support Vector machines. In the present setting, we ap-proximate each Eigenvector V = Xi=1 �i�(xi) (3.19)(Eq. (3.10)) by another vector ~V = mXj=1 �j�(zj); (3.20)where m < ` is chosen a priori according to the desired speed-up, and zj 2 RN ; j =1; : : : ; m. This is done by minimizing the squared di�erence� = kV� ~Vk2: (3.21)This can be carried out without explicitely dealing with the possibly high-dimensionalspace F . Since � = kVk+ mXi;j=1�i�jk(zi; zj)� 2Xi=1 mXj=1�i�jk(xi; zj); (3.22)the gradient of � with respect to the �j and the zj is readily expressed in terms of thekernel function, thus � can be minimized by standard gradient methods. For the taskof handwritten character recognition, this technique led to a speed-up by a factor of50 at almost no loss in accuracy (Burges & Sch�olkopf, 1996; cf. Sec. 4.4.1).Finally, we add that although kernel principal component extraction is computa-tionally more expensive than its linear counterpart, this additional investment canpay back afterwards. In experiments on classi�cation based on the extracted princi-pal components, we found when we trained on nonlinear features, it was su�cient touse a linear Support Vector machine to construct the decision boundary. Linear Sup-port Vector machines, however, are much faster in classi�cation speed than nonlinearones. This is due to the fact that for k(x;y) = (x � y), the Support Vector deci-sion function (2.25) can be expressed with a single weight vector w = Pi=1 yi�ixi asf(x) = sgn((x�w)+b): Thus the �nal stage of classi�cation can be done extremely fast;the speed of the principal component extraction phase, on the other hand, and thusthe accuracy-speed trade-o� of the whole classi�er, can be controlled by the numberof components which we extract, or by the number m (cf. Eq. (3.20)).

References

Page 144: A quick introduction to Statistical Learning Theory and Support Vector Machines

3.3. KERNEL PRINCIPAL COMPONENT ANALYSIS 87Interpretability and Variable Selection. In PCA, it is sometimes desirable to beable to select speci�c axes which span the subspace into which one projects in doingprincipal component extraction. This way, it may for instance be possible to choosevariables which are more accessible to interpretation. In the nonlinear case, there isan additional problem: some elements of F do not have pre-images in input space.To make this plausible, note that the linear span of the training examples mappedinto feature space can have dimensionality up to M (the number of examples). Ifthis exceeds the dimensionality of input space, it is rather unlikely that each vectorof the form (3.10) has a pre-image (cf. Appendix D.1.2). To get interpretability, wethus need to �nd directions in input space (i.e. input variables) whose images under� span the PCA subspace in F . This can be done with an approach akin to the onedescribed above: we could parametrize our set of desired input variables and run theminimization of (3.22) only over those parameters. The parameters can be e.g. groupparameters which determine the amount of translation, say, starting from a set ofimages.Dimensionality Reduction, Feature Extraction, and Reconstruction. Unlike linearPCA, the proposed method allows the extraction of a number of principal componentswhich can exceed the input dimensionality. Suppose that the number of observationsM exceeds the input dimensionality N . Linear PCA, even when it is based on theM � M dot product matrix, can �nd at most N nonzero Eigenvalues | they areidentical to the nonzero Eigenvalues of the N � N covariance matrix. In contrast,kernel PCA can �nd up to M nonzero Eigenvalues | a fact that illustrates that itis impossible to perform kernel PCA directly on an N � N covariance matrix. Evenmore features could be extracted by using several kernels.Being just a basis transformation, standard PCA allows the reconstruction of theoriginal patterns xi; i = 1; : : : ; `; from a complete set of extracted principal components(xi �vj); j = 1; : : : ; `, by expansion in the Eigenvector basis. Even from an incompleteset of components, good reconstruction is often possible. In kernel PCA, this is moredi�cult: we can reconstruct the image of a pattern in F from its nonlinear components;however, if we only have an approximate reconstruction, there is no guarantee thatwe can �nd an exact pre-image of the reconstruction in input space. In that case, wewould have to resort to an approximation method (cf. (3.22)). Alternatively, we coulduse a suitable regression method for estimating the reconstruction mapping from thekernel-based principal components to the inputs.Comparison to Other Methods for Nonlinear PCA. Starting from some of theproperties characterizing PCA (see above), it is possible to develop a number of possi-ble generalizations of linear PCA to the nonlinear case. Alternatively, one may choosean iterative algorithm which adaptively estimates principal components, and makesome of its parts nonlinear to extract nonlinear features. Rather than giving a fullreview of this �eld here, we brie y describe just three approaches, and refer the readerto Diamantaras & Kung (1996) for more details.

References

Page 145: A quick introduction to Statistical Learning Theory and Support Vector Machines

88 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSISHebbian Networks. Initiated by the pioneering work of Oja (1982), a number ofunsupervised neural-network type algorithms computing principal components havebeen proposed (e.g. Sanger, 1989). Compared to the standard approach of diago-nalizing the covariance matrix, they have advantages for instance in cases where thedata are nonstationary. Nonlinear variants of these algorithms are obtained by addingnonlinear activation functions. The algorithms then extract features that the authorshave referred to as nonlinear principal components. These approaches, however, donot have the geometrical interpretation of kernel PCA as a standard PCA in a featurespace nonlinearly related to input space, and it is thus more di�cult to understandwhat exactly they are extracting. For a discussion of some approaches, see (Karhunenand Joutsensalo, 1995).Autoassociative Multi-Layer Perceptrons. Consider a linear perceptron with onehidden layer, which is smaller than the input. If we train it to reproduce the inputvalues as outputs (i.e. use it in autoassociative mode), then the hidden unit activationsform a lower-dimensional representation of the data, closely related to PCA (see forinstance Diamantaras & Kung, 1996). To generalize to a nonlinear setting, one usesnonlinear activation functions and additional layers.4 While this of course can beconsidered a form of nonlinear PCA, it should be stressed that the resulting networktraining consists in solving a hard nonlinear optimization problem, with the possibilityto get trapped in local minima, and thus with a dependence of the outcome on thestarting point of the training. Moreover, in neural network implementations there isoften a risk of getting over�tting. Another drawback of neural approaches to nonlinearPCA is that the number of components to be extracted has to be speci�ed in advance.As an aside, note that hyperbolic tangent kernels can be used to extract neural networktype nonlinear features using kernel PCA (Fig. 3.6). The principal components of atest point x in that case take the form (Fig. 3.2) Pi �ni tanh(�(xi;x) + �).Principal Curves. An approach with a clear geometric interpretation in inputspace is the method of principal curves (Hastie & Stuetzle, 1989), which iterativelyestimates a curve (or surface) capturing the structure of the data. The data areprojected onto (i.e. mapped to the closest point on) a curve, and the algorithm triesto �nd a curve with the property that each point on the curve is the average of alldata points projecting onto it. It can be shown that the only straight lines satisfyingthe latter are principal components, so principal curves are indeed a generalization ofthe latter. To compute principal curves, a nonlinear optimization problem has to besolved. The dimensionality of the surface, and thus the number of features to extract,is speci�ed in advance. Some authors (e.g. Ritter, Martinetz, and Schulten, 1990) havediscussed parallels between the Principal Curve algorithm and self-organizing featuremaps (Kohonen, 1982) for dimensionality reduction.4Simply using nonlinear activation functions in the hidden layer would not su�ce: already thelinear activation functions lead to the best approximation of the data (given the number of hiddennodes), so for the nonlinearities to have an e�ect on the components, the architecture needs to bechanged (see e.g. Diamantaras & Kung, 1996).

References

Page 146: A quick introduction to Statistical Learning Theory and Support Vector Machines

3.4. FEATURE EXTRACTION EXPERIMENTS 89Kernel PCA. Kernel PCA is a nonlinear generalization of PCA in the sense that (a)it is performing PCA in feature spaces of arbitrarily large (possibly in�nite) dimension-ality, and (b) if we use the kernel k(x;y) = (x �y), we recover original PCA. Comparedto the above approaches, kernel PCA has the main advantage that no nonlinear opti-mization is involved | it is essentially linear algebra, as simple as standard PCA. Inaddition, we need not specify the number of components that we want to extract inadvance. Compared to neural approaches, kernel PCA could be disadvantageous if weneed to process a very large number of observations, as this results in a large matrixK. Compared to principal curves, kernel PCA is so far harder to interpret in inputspace; however, at least for polynomial kernels, it has a very clear interpretation interms of higher-order features.3.4 Feature Extraction ExperimentsIn this section, we present a set of experiments where we used kernel PCA (in the formgiven in Appendix D.2.2) to extract principal components. First, we shall take a lookat a simple toy example; following that, we describe real-world experiments where weassess the utility of the extracted principal components by classi�cation tasks.Toy Examples. To provide some intuition on how PCA in F behaves in input space,we show a set of experiments with an arti�cial 2-D data set, using polynomial kernels(cf. Eq.( 2.26)) of degree 1 through 4 (see Fig. 3.3). Linear PCA (on the left) onlyleads to 2 nonzero Eigenvalues, as the input dimensionality is 2. In contrast, nonlinearPCA allows the extraction of further components. In the �gure, note that nonlinearPCA produces contour lines of constant feature value which re ect the structure inthe data better than in linear PCA. In all cases, the �rst principal component variesmonotonically along the parabola which underlies the data. In the nonlinear cases,also the second and the third components show behaviour which is similar for di�er-ent polynomial degrees. The third component, which comes with small Eigenvalues(rescaled to sum to 1), seems to pick up the variance caused by the noise, as can benicely seen in the case of degree 2. Dropping this component would thus amount tonoise reduction.In Fig. 3.3, it can be observed that for larger polynomial degrees, the principalcomponent extraction functions become increasingly at around the origin. Thus,di�erent data points not too far from the origin would only di�er slightly in the valueof their principal components. To understand this, consider the following example:suppose we have two data pointsx1 = 10 ! ; x2 = 20 ! ; (3.23)and a kernel k(x;y) := (x � y)2: Then the di�erences between the entries of x1 and

References

Page 147: A quick introduction to Statistical Learning Theory and Support Vector Machines

90 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSIS

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.000Eigenvalue=0.000

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.291Eigenvalue=0.291

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.709Eigenvalue=0.709

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.034Eigenvalue=0.034

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.345Eigenvalue=0.345

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.621Eigenvalue=0.621

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.026Eigenvalue=0.026

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.395Eigenvalue=0.395

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.570Eigenvalue=0.570

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.021Eigenvalue=0.021

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.418Eigenvalue=0.418

−1 0 1−0.5

0

0.5

1

1.5Eigenvalue=0.552Eigenvalue=0.552

FIGURE 3.3: Two-dimensional toy example, with data generated in the following way: x-values have uniform distribution in [�1; 1], y-values are generated from yi = x2i + �, were� is normal noise with standard deviation 0.2. From left to right, the polynomial degreein the kernel (2.26) increases from 1 to 4; from top to bottom, the �rst 3 Eigenvectorsare shown, in order of decreasing Eigenvalue size. The �gures contain lines of constantprincipal component value (contour lines); in the linear case, these are orthogonal to theEigenvectors. We did not draw the Eigenvectors, as in the general case, they live in ahigher-dimensional feature space.x2 get scaled up by the kernel, namely k(x1;x1) = 1, but k(x2;x2) = 16. We cancompensate for this by rescaling the individual entries of each vector xi by(xi)k 7! sign ((xi)k) � j(xi)kj 12 : (3.24)Indeed, Fig. 3.4 shows that when the data are preprocessed according to (3.24) (wherehigher degrees are treated correspondingly), the �rst principal component extractorsdo hardly depend on the degree anymore, as long as it is larger than 1. If necessary,

References

Page 148: A quick introduction to Statistical Learning Theory and Support Vector Machines

3.4. FEATURE EXTRACTION EXPERIMENTS 91−1 0 1

−0.5

0

0.5

1

−1 0 1−0.5

0

0.5

1

−1 0 1−0.5

0

0.5

1

−1 0 1−0.5

0

0.5

1

−1 0 1−0.5

0

0.5

1

FIGURE 3.4: PCA with kernel (2.26), degrees d = 1; : : : ; 5. 100 points ((xi)1; (xi)2) weregenerated from (xi)2 = (xi)21+ noise (Gaussian, with standard deviation 0.2); all (xi)jwere rescaled according to (xi)j 7! sgn((xi)j) � j(xi)jj1=d. Displayed are contour lines ofconstant value of the �rst principal component. Nonlinear kernels (d > 1) extract featureswhich nicely increase along the direction of main variance in the data; linear PCA (d = 1)does its best in that respect, too, but it is limited to straight directions.

FIGURE 3.5: Two-dimensional toy example with three data clusters (Gaussians with stan-dard deviation 0.1, depicted region: [�1; 1] � [�0:5; 1]): �rst 8 nonlinear principal com-ponents extracted with k(x;y) = exp (�kx� yk2=0:1). Note that the �rst 2 principalcomponent (top left) nicely separate the three clusters. The components 3 { 5 split up theclusters into halves. Similarly, the components 6 { 8 split them again, in a way orthogonalto the above splits. Thus, the �rst 8 components divide the data into 12 regions.we can thus use (3.24) to preprocess our data. Note, however, that the above scalingproblem is irrelevant for the character and object databases to be considered below:there, most entries of the patterns are �1.Further toy examples, using radial basis function kernels (1.28) and neural networktype sigmoid kernels (1.29), are shown in �gures 3.5 { 3.8.Object Recognition. In this set of experiments, we used the MPI chair databasewith 89 training views per object (Appendix A). We computed the matrix K from all

References

Page 149: A quick introduction to Statistical Learning Theory and Support Vector Machines

92 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSIS

FIGURE 3.6: Two-dimensional toy example with three data clusters (Gaussians with stan-dard deviation 0.1, depicted region: [�1; 1] � [�0:5; 1]): �rst 6 nonlinear principal com-ponents extracted with k(x;y) = tanh (2(x � y)� 1) (the gain and threshold values werechosen according to the values used in SV machines, cf. Table 2.4). Note that the �rst 2principal components are su�cient to separate the three clusters, and the third and fourthcomponent simultaneously split all clusters into halves.2225 training examples, and used polynomial kernel PCA to extract nonlinear principalcomponents from the training and test set. To assess the utility of the components, wetrained a soft margin hyperplane classi�er (Sec. 2.1.3) on the classi�cation task. This isa special case of Support Vector machines, using the standard dot product as a kernelfunction. Table 3.1 summarizes our �ndings: in all cases, nonlinear components asextracted by polynomial kernels (Eq. (2.26) with d > 1) led to classi�cation accuraciessuperior to standard PCA. Speci�cally, the nonlinear components a�orded top testperformances between 2% and 4% error, whereas in the linear case we obtained 17%.Character Recognition. To validate the above results on a widely used pattern re-cognition benchmark database, we repeated the same experiments on the US postalservice database of handwritten digits (Appendix C). This database contains 9298examples of dimensionality 256; 2007 of them make up the test set. For computa-tional reasons, we decided to use a subset of 3000 training examples for the matrix K.Table 3.2 illustrates two advantages of using nonlinear kernels: �rst, performance of alinear classi�er trained on nonlinear principal components is better than for the samenumber of linear components; second, the performance for nonlinear components canbe further improved by using more components than possible in the linear case. Thelatter is related to the fact that of course there are many more higher-order featuresthan there are pixels in an image. Regarding the �rst point, note that extracting acertain number of features in a 1010-dimensional space constitutes a much higher reduc-

References

Page 150: A quick introduction to Statistical Learning Theory and Support Vector Machines

3.4. FEATURE EXTRACTION EXPERIMENTS 93

FIGURE 3.7: For di�erent threshold values � (from top to bottom: � = �4;�2;�1; 0; 2),kernel PCA with hyperbolic tangent kernels k(x;y) = tanh (2(x � y) + �) exhibits qual-itatively similar behaviour (data as in the previous �gures). In all cases, the �rst twocomponents capture the main structure of the data, whereas the third components splitthe clusters.

References

Page 151: A quick introduction to Statistical Learning Theory and Support Vector Machines

94 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSIS

FIGURE 3.8: A smooth transition from linear PCA to nonlinear PCA is obtained by usinghyperbolic tangent kernels k(x;y) = tanh (�(x � y) + 1) with varying gain �: from topto bottom, � = 0:1; 1; 5; 10 (data as in the previous �gures). For � = 0:1, the �rst twofeatures look like linear PCA features. For large �, the nonlinear region of the tanh functionbecomes e�ective. In that case, kernel PCA can exploit this nonlinearity to allocate thehighest feature gradients to regions where there are data points, as can be seen nicely inthe case � = 10.References

Page 152: A quick introduction to Statistical Learning Theory and Support Vector Machines

3.4. FEATURE EXTRACTION EXPERIMENTS 95Test Error Rate for degree# of components 1 2 3 4 5 6 764 23.0 21.0 17.6 16.8 16.5 16.7 16.6128 17.6 9.9 7.9 7.1 6.2 6.0 5.8256 16.8 6.0 4.4 3.8 3.4 3.2 3.3512 n.a. 4.4 3.6 3.9 2.8 2.8 2.61024 n.a. 4.1 3.0 2.8 2.6 2.6 2.42048 n.a. 4.1 2.9 2.6 2.5 2.4 2.2TABLE 3.1: Test error rates on the MPI chair database for linear Support Vector machinestrained on nonlinear principal components extracted by PCA with polynomial kernel (2.26),for degrees 1 through 7. In the case of degree 1, we are doing standard PCA, with thenumber of nonzero Eigenvalues being at most the dimensionality of the space, 256; thus,we can extract at most 256 principal components. The performance for the nonlinear cases(degree > 1) is signi�cantly better than for the linear case, illustrating the utility of theextracted nonlinear components for classi�cation.Test Error Rate for degree# of components 1 2 3 4 5 6 732 9.6 8.8 8.1 8.5 9.1 9.3 10.864 8.8 7.3 6.8 6.7 6.7 7.2 7.5128 8.6 5.8 5.9 6.1 5.8 6.0 6.8256 8.7 5.5 5.3 5.2 5.2 5.4 5.4512 n.a. 4.9 4.6 4.4 5.1 4.6 4.91024 n.a. 4.9 4.3 4.4 4.6 4.8 4.62048 n.a. 4.9 4.2 4.1 4.0 4.3 4.4TABLE 3.2: Test error rates on the USPS handwritten digit database for linear SupportVector machines trained on nonlinear principal components extracted by PCA with kernel(2.26), for degrees 1 through 7. In the case of degree 1, we are doing standard PCA, withthe number of nonzero Eigenvalues being at most the dimensionality of the space, 256.Clearly, nonlinear principal components a�ord test error rates which are superior to thelinear case (degree 1).tion of dimensionality than extracting the same number of features in 256-dimensionalinput space.For all numbers of features, the optimal degree of kernels to use is around 4, whichis compatible with Support Vector machine results on the same data set (cf. Sec. 2.3and Fig. 2.16). Moreover, with only one exception, the nonlinear features are superiorto their linear counterparts. The resulting error rate for the best of our classi�ers(4.0%) is competitive with convolutional 5-layer neural networks (5.0% were reportedby LeCun et al., 1989) and nonlinear Support Vector classi�ers (4.0%, Table 2.4); it

References

Page 153: A quick introduction to Statistical Learning Theory and Support Vector Machines

96 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSISis much better than linear classi�ers operating directly on the image data (a linearSupport Vector machine achieves 8.9%; Table 2.4). Our results were obtained withoutusing any prior knowledge about symmetries of the problem at hand, explaining whythe performance is inferior to Virtual Support Vector classi�ers (3.2%, Table 4.1),and Tangent Distance Nearest Neighbour classi�ers (2.6%, Simard, LeCun, & Denker,1993). We believe that adding e.g. local translation invariance, be it by generating\virtual" translated examples (cf. Sec. 4.2.1) or by choosing a suitable kernel (e.g. asthe ones that we shall describe in Sec. 4.3), could further improve the results.3.5 DiscussionFeature Extraction for Classi�cation. This chapter was devoted to the presentationof a new technique for nonlinear PCA. To develop this technique, we made use of akernel method so far only used in supervised learning (Vapnik, 1995; Sec. 1.3). KernelPCA constitutes a mere �rst step towards exploiting this technique for a large class ofalgorithms.In experiments comparing the utility of kernel PCA features for pattern recognitionusing a linear classi�er, we found two advantages of nonlinear kernels: �rst, nonlinearprincipal components a�orded better recognition rates than corresponding numbers oflinear principal components; and second, the performance for nonlinear componentscan be further improved by using more components than possible in the linear case.We have not yet compared kernel PCA to other techniques for nonlinear feature extrac-tion and dimensionality reduction. We can, however, compare results to other featureextraction methods which have been used in the past by researchers working on theUSPS classi�cation problem (cf. Sec. 3.4). Our system of kernel PCA feature extrac-tion plus linear support vector machine for instance performed better than LeNet1(LeCun et al., 1989). Even though the latter result has been obtained a number ofyears ago, it should be stressed that LeNet1 provides an architecture which contains agreat deal of prior information about the handwritten character classi�cation problem.It uses shared weights to improve transformation invariance, and a hierarchy of featuredetectors resembling parts of the human visual system. These feature detectors werefor instance used by Bottou and Vapnik (1992) as a preprocessing stage in their exper-iments in local learning. Note that, in addition, our features were extracted withouttaking into account that we want to do classi�cation. Clearly, in supervised learning,where we are given a set of labelled observations (x1; y1); : : : ; (x`; y`), it would seemadvisable to make use of the labels not only during the training of the �nal classi�er,but already in the stage of feature extraction.To conclude this paragraph on feature extraction for classi�cation, we note that asimilar approach could be taken in the case of regression estimation.Feature Space and the Curse of Dimensionality. We are doing PCA in 1010-dimensional feature spaces, yet getting results in �nite time which are comparableto state-of-the-art techniques. In fact, of course, we are not working in the full feature

References

Page 154: A quick introduction to Statistical Learning Theory and Support Vector Machines

3.5. DISCUSSION 97space, but just in a comparably small linear subspace of it, whose dimension equalsat most the number of observations. The method automatically chooses this subspaceand provides a means of taking advantage of the lower dimensionality | an approachwhich consisted in explicitely mapping into feature space and then performing PCAwould have severe di�culties at this point: even if PCA was done based on anM �Mdot product matrix (M being the sample size) whose diagonalization is tractable, itwould still be necessary to evaluate dot products in a 1010-dimensional feature spaceto compute the entries of the matrix in the �rst place. Kernel-based methods avoidthis problem | they do not explicitely compute all dimensions of F (loosely speaking,all possible features), but only work in a relevant subspace of F .Note, moreover, that we did not get over�tting problems when training the linearSV classi�er on the extracted features. The basic idea behind this two-step approachis very similar in spirit to nonlinear SV machines: one maps into a very complex spaceto be able to approximate a large class of possible decision functions, and then uses alow VC-dimension classi�er in that space to control generalization.Conclusion. Compared to other techniques for nonlinear feature extraction, kernelPCA has the advantages that it does not require nonlinear optimization, but only thesolution of an Eigenvalue problem, and by the possibility to use di�erent kernels, itcomprises a fairly general class of nonlinearities that can be used. Clearly, the lastpoint has yet to be evaluated in practice, however, for the support vector machine, theutility of di�erent kernels has already been established. Di�erent kernels (polynomial,sigmoid, Gaussian) led to �ne classi�cation performances (Table 2.4). The generalquestion of how to select the ideal kernel for a given task (i.e. the appropriate featurespace), however, is an open problem.We conclude this chapter with a twofold outlook. The scene has been set forusing the kernel method to construct a wide variety of rather general and still feasiblenonlinear variants of classical algorithms. It is beyond the scope of the present workto explore all the possibilities, including many distance-based algorithms, in detail.Some of them are currently being investigated, for instance nonlinear forms of k-means clustering and kernel-based independent component analysis (Sch�olkopf, Smola,& M�uller, 1996). Other domains where researchers have recently started to investigatethe use of Mercer kernels include Gaussian Processes (Williams, 1997).Linear PCA is being used in numerous technical and scienti�c applications, includ-ing noise reduction, density estimation, image indexing and retrieval systems, and theanalysis of natural image statistics. Kernel PCA can be applied to all domains wheretraditional PCA has so far been used for feature extraction, and where a nonlinearextension would make sense.References

Page 155: A quick introduction to Statistical Learning Theory and Support Vector Machines

98 CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSIS

References

Page 156: A quick introduction to Statistical Learning Theory and Support Vector Machines

Chapter 4Prior Knowledge in Support Vector MachinesIn 1995, LeCun et al. published a pattern recognition performance comparison notingthe following:\The optimal margin classi�er [i.e. SV machine, the author] has excellentaccuracy, which is most remarkable, because unlike the other high per-formance classi�ers, it does not include a priori knowledge about theproblem. In fact, this classi�er would do just as well if the image pix-els were permuted by a �xed mapping. [...] However, improvements areexpected as the technique is relatively new."One of the key points in developing SV technology is thus the incorporation of priorknowledge about given tasks. Moreover, it is also a key point if we want to learnanything general about the processing of visual information in animals from SV ma-chines: having been exposed to the world for all their life, animals extensively exploitany available knowledge on regularities and invariances of the world.Two years after the above statement was published, we are now in the position to beable to devote the present chapter to three techniques for incorporating task-speci�cprior knowledge in SV machines (Sch�olkopf, Burges, and Vapnik, 1996a; Sch�olkopf,Simard, Smola, and Vapnik, 1997a).4.1 IntroductionWhen we are trying to extract regularities from data, we often have additional knowl-edge about functions that we estimate (for a review, see Abu-Mostafa, 1995). Forinstance, in image classi�cation tasks, there exist transformations which leave classmembership invariant (e.g. translations); moreover, it is usually the case that imageshave a local structure in the sense that not all correlations between image regions carryequal amounts of information. We presently investigate the question how to make useof these two sources of knowledge.We �rst present the Virtual SV method of incorporating prior knowledge abouttransformation invariances by applying transformations to Support Vectors, the train-ing examples most critical for determining the classi�cation boundary (Sec. 4.2.1).99

References

Page 157: A quick introduction to Statistical Learning Theory and Support Vector Machines

100 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESIn Sec. 4.2.2, we design kernel functions which lead to invariant classi�cation hyper-planes. This method is applicable to invariances under the action of di�erentiable local1-parameter groups of local transformations, e.g. translational invariance in patternrecognition; the Virtual SV method is applicable to any type of invariance. In thethird method proposed in this chapter, we also modify the kernel functions; however,this time not to incorporate transformation invariance, but to take into account im-age locality by using localized receptive �elds (Sec. 4.3). Following that, Sec. 4.4 andSec. 4.5 give experimental results and a discussion, respectively.4.2 Incorporating Transformation InvariancesIn many applications of learning procedures, certain transformations of the input areknown to leave function values unchanged. At least three di�erent ways of exploitingthis knowledge have been used (illustrated in Fig. 4.1):In the �rst case, the knowledge is used to generate arti�cial training examples (\vir-tual examples", Poggio and Vetter, 1992; Baird, 1990) by transforming the trainingexamples accordingly. It is then hoped that given su�cient time, the learning machinewill extract the invariances from the arti�cially enlarged training data.In the second case, the learning algorithm itself is modi�ed. This is typically doneby using a modi�ed error function which forces a learning machine to construct afunction with the desired invariances (Simard et al., 1992).Finally, in the third case, the invariance is achieved by changing the representationof the data by �rst mapping them into a more suitable space; an approach pursued forinstance by Segman, Rubinstein, and Zeevi (1992), or Vetter and Poggio (1997). Thedata representation can also be changed by using a modi�ed distance metric, ratherthan actually changing the patterns (e.g. Simard, LeCun, and Denker, 1993).Simard et al. (1992) compare the �rst two techniques and �nd that for the con-sidered problem | learning a function with three plateaus where function values arelocally invariant | training on the arti�cially enlarged data set is signi�cantly slower,due to both correlations in the arti�cial data and the increase in training set size.Moving to real-world applications, the latter factor becomes even more important. Ifthe size of a training set is multiplied by a number of desired invariances (by generatinga corresponding number of arti�cial examples for each training pattern), the resultingtraining sets can get rather large (as the ones used by Drucker, Schapire, and Simard,1993). However, the method of generating virtual examples has the advantage of beingreadily implemented for all kinds of learning machines and symmetries. If instead ofLie groups of symmetry transformations one is dealing with discrete symmetries, as thebilateral symmetries of Vetter, Poggio, and B�ultho� (1994); Vetter and Poggio (1994),derivative-based methods such as the ones of Simard et al. (1992) are not applicable.It would thus be desirable to have an intermediate method which has the advantagesof the virtual examples approach without its computational cost.The two methods described in the following try to combine merits of all the ap-proaches mentioned above. The Virtual SV method (Sec. 4.2.1) retains the exibility

References

Page 158: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.2. INCORPORATING TRANSFORMATION INVARIANCES 101

representationtangentsvirtual examplesFIGURE 4.1: Di�erent ways of incorporating invariances in a decision function. The dashedline marks the \true" boundary, disks and circle are the training examples. We assumethat prior information tells us that the classi�cation function only depends on the norm ofthe input vector (the origin being in the center of each picture). Lines crossing an exampleindicate the type of information conveyed by the di�erent methods of incorporating priorinformation. Left: generating virtual examples in a localized region around each trainingexample; middle: incorporating a regularizer to learn tangent values (cf. Simard, Victorri,LeCun, and Denker, 1992); right: changing the representation of the data by �rst mappingeach example to its norm. If feasible, the latter method yields the most information.However, if the necessary nonlinear transformation cannot be found, or if the desiredinvariances are of localized nature, one has to resort to one of the former techniques.Finally, the reader may note that examples close to the boundary allow us to exploitprior knowledge very e�ectively: given a method to get a �rst approximation of the trueboundary, the examples closest to it would allow good estimation of the true boundary. Asimilar two-step approach is pursued in Sec. 4.2.1. (From Sch�olkopf, Burges, and Vapnik(1996a).)and simplicity of virtual examples approaches, while cutting down on their computa-tional cost signi�cantly. The Invariant Hyperplane method (Sec. 4.2.2), on the otherhand, is comparable to the method of Simard et al. (1992) in that it is applicable for alldi�erentiable local 1-parameter groups of local symmetry transformations, comprisinga fairly general class of invariances. In addition, it has an equivalent interpretationas a preprocessing operation applied to the data before learning. In this sense, it canalso be viewed as changing the representation of the data to a more invariant one, ina task-dependent way.4.2.1 The Virtual SV MethodIn Sec. 2.4, it has been argued that the SV set contains all information necessaryto solve a given classi�cation task. It particular, it was possible to train any oneof three di�erent types of SV machines solely on the Support Vector set extracted

References

Page 159: A quick introduction to Statistical Learning Theory and Support Vector Machines

102 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESby another machine, with a test performance not worse than after training on thefull database. Using this �nding as a starting point, we now investigate the questionwhether it might be su�cient to generate virtual examples from the Support Vectorsonly. After all, one might hope that it does not add much information to generatevirtual examples of patterns which are not close to the boundary. In high-dimensionalcases, however, care has to be exercised regarding the validity of this intuitive picture.Thus, an experimental test on a high-dimensional real-world problem is imperative.In our experiments, we proceeded as follows (cf. Fig. 4.2):1. Train a Support Vector machine to extract the Support Vector set.2. Generate arti�cial examples by applying the desired invariance transformationsto the Support Vectors. In the following, we will refer to these examples asVirtual Support Vectors (VSVs).3. Train another Support Vector machine on the generated examples.1If the desired invariances are incorporated, the curves obtained by applying Lie sym-metry transformations to points on the decision surface should have tangents parallelto the latter (cf. Simard et al., 1992). If we use small small Lie group transformationsto generate the virtual examples, this implies that the Virtual Support Vectors shouldbe approximately as close to the decision surface as the original Support Vectors.Hence, they are fairly likely to become Support Vectors after the second training run.Vice versa, if a substantial fraction of the Virtual Support Vectors turn out to becomesupport vectors in the second run, we have reason to expect that the decision surfacedoes have the desired shape.4.2.2 Constructing Invariance KernelsInvariance by a Self-Consistency Argument. We face the following problem: toexpress the condition of invariance of the decision function, we already need to knowits coe�cients which are found only during the optimization, which in turn shouldalready take into account the desired invariances. As a way out of this circle, we usethe following ansatz: consider decision functions f = sgn � g, where g is de�ned asg(xj) := Xi=1 �iyi(Bxj �Bxi) + b; (4.1)with a matrix B to be determined below. This follows Vapnik (1995b), who suggestedto incorporate invariances by modifying the dot product used: any nonsingular Bde�nes a dot product, which can equivalently be written in the form (xj � Axi), witha positive de�nite matrix A = B>B.1Clearly, the scheme can be iterated; however, care has to exercised, since the iteration of localinvariances would lead to global ones which are not always desirable | cf. the example of a '6'rotating into a '9' (Simard, LeCun, and Denker, 1993).

References

Page 160: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.2. INCORPORATING TRANSFORMATION INVARIANCES 103?

? ?

?

problem separating hyperplanes

SV hyperplane VSV hyperplaneFIGURE 4.2: Suppose we have prior knowledge indicating that the decision function shouldbe invariant with respect to horizontal translations. The true decision boundary is drawnas a dotted line (top left); however, as we are just given a limited training sample, di�erentseparating hyperplanes are conveivable (top right). The SV algorithm �nds the uniqueseparating hyperplane with maximal margin (bottom left, which in this case is quite dif-ferent from the true boundary. For instance, it would lead to wrong classi�cation of theambiguous point indicated by the question mark. Making use of the prior knowledge bygenerating Virtual Support Vectors from the Support Vectors found in a �rst tranining run,and retraining on these, yields a more accurate decision boundary (bottom right). Note,moreover, that for the considered example, it is su�cient to train the SV machine only onvirtual examples generated from the Support Vectors.Clearly, invariance of g under local transformations of all xj is a su�cient conditionfor the same invariance to hold for f = sgn�g, which is what we are aiming for. Strictlyspeaking, however, invariance of g is not necessary at points which are not SupportVectors, since these lie in a region where (sgn � g) is constant.The above notion of invariance refers to invariance when evaluating the decisionfunction. A di�erent notion could ask the question whether the separating hyperplane,

References

Page 161: A quick introduction to Statistical Learning Theory and Support Vector Machines

104 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESincluding its margin, would change if the training examples were transformed. Itturns out that when discussing the invariance of g rather than f , these two conceptsare closely related. In the following argument, we restrict ourselves to the optimalmargin case (�i = 0 for all i = 1; : : : ; `), where the margin is well-de�ned. As theseparating hyperplane and its margin are expressed in terms of Support Vectors, locallytransforming a Support Vector xi will change the hyperplane or the margin if g(xi)changes: if jgj gets smaller than 1, the transformed pattern will lie in the margin, andthe recomputed margin will be smaller; if jgj gets larger than 1, the margin mightbecome bigger, depending on whether the pattern can be expressed in terms of theother SVs (cf. the remark in point 2 of the enumeration preceding Proposition 2.1.2).In terms of the mechanical analogy of Sec. 2.1.2: moving Support Vectors changesthe mechanical equilibrium for the sheet separating the classes. Conversely, a localtransformation of a non-Support Vector will never change f , even if the value of gchanges, as the solution of the programming problem is expressed in terms of theSupport Vectors only.In this sense, invariance of f under local transformations of the given data corre-sponds to invariance of (4.1) for the Support Vectors. Note, however, that this criterionis not readily applicable: before �nding the Support Vectors in the optimization, wealready need to know how to enforce invariance. Thus the above observation cannotbe used directly, however it could serve as a starting point for constructing heuristicsor iterative solutions. In the Virtual SV method (Sec. 4.2.1), a �rst run of the stan-dard SV algorithm is carried out to obtain an initial SV set; similar heuristics couldbe applied in the present case.Local invariance of g for each pattern xj under transformations of a di�erentiablelocal 1-parameter group of local transformations Lt,@@t ���t=0g(Ltxj) = 0; (4.2)can be approximately enforced by minimizing the regularizer1 Xj=1 @@t ���t=0g(Ltxj)!2 : (4.3)Note that the sum may run over labelled as well as unlabelled data, so in principle onecould also require the decision function to be invariant with respect to transformationsof elements of a test set. Moreover, we could use di�erent transformations for di�erentpatterns.For (4.1), the local invariance term (4.2) becomes@@t ���t=0g(Ltxj) = @@t ���t=0 Xi=1 �iyi(BLtxj �Bxi) + b!= Xi=1 �iyi @@t ���t=0(BLtxj �Bxi)

References

Page 162: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.2. INCORPORATING TRANSFORMATION INVARIANCES 105= Xi=1 �iyi@1(BL0xj �Bxi) �B @@t ���t=0Ltxj; (4.4)using the chain rule. Here, @1(BL0xj �Bxi) denotes the gradient of (x �y) with respectto x, evaluated at the point (x � y) = (BL0xj �Bxi).As a side remark, note that a su�cient, albeit rather strict condition for invarianceis thus that @@t ���t=0(BLtxj � Bxi) vanish for all i; j; however, we will proceed in ourderivation, with the goal to impose weaker conditions, which apply for one speci�cdecision function rather than simultaneously for all decision functions expressible bydi�erent choices of the coe�cients �iyi.Substituting (4.4) into (4.3), using the facts that L0 = I and @1(x;y) = y>, yieldsthe regularizer1 Xj=1 Xi=1 �iyi(Bxi)>B @@t ���t=0Ltxj!2= 1 Xj=1 Xi=1 �iyi(Bxi)>B @@t ���t=0Ltxj! Xk=1�kyk(B @@t ���t=0Ltxj)>(Bxk)!= Xi;k=1�iyi�kyk(Bxi �BCB>Bxk) (4.5)where C := 1 Xj=1 @@t ���t=0Ltxj! @@t ���t=0Ltxj!> : (4.6)We now choose B such that (4.5) reduces to the standard SV target function (2.7)in the form obtained by substituting (2.11) into it (cf. the quadratic term of (2.13)),utilizing the dot product chosen in (4.1), i.e. such that(Bxi �BCB>Bxk) = (Bxi �Bxk): (4.7)Assuming that the xi span the whole space, this condition becomesB>BCB>B = B>B; (4.8)or, by requiring B to be nonsingular, i.e. that no information get lost during thepreprocessing, BCB> = I. This can be satis�ed by a preprocessing matrixB = C� 12 ; (4.9)the nonnegative square root of the inverse of the nonnegative matrix C de�ned in(4.6). In practice, we use a matrixC� := (1� �)C + �I; (4.10)

References

Page 163: A quick introduction to Statistical Learning Theory and Support Vector Machines

106 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESwith 0 < � � 1; instead of C. As C is nonnegative, C� is invertible. For � = 1, werecover the standard SV optimal hyperplane algorithm, other values of � determine thetrade-o� between invariance and model complexity control. It can be shown that usingC� corresponds to using an objective function �(w) = (1 � �)Pi(w � @@t jt=0Ltxi)2 +�kwk2 (see Appendix D.3).By choosing the preprocessing matrix B according to (4.9), we have obtained aformulation of the problem where the standard SV quadratic optimization techniquedoes in e�ect minimize the tangent regularizer (4.3): the maximum of (2.13) subjectto (2.14) and (2.15), using the modi�ed dot product as in (4.1), coincides with theminimum of (4.3) subject to the separation conditions yi �g(xi) � 1, where g is de�nedas in (4.1).Note that preprocessing with B does not a�ect classi�cation speed: since (Bxj �Bxi) = (xj � B>Bxi), we can precompute B>Bxi for all SVs xi and thus obtain amachine (with modi�ed SVs) which is as fast as a standard SV machine (cf. (4.1)).In the nonlinear case, where kernel functions k(x;y) are substituted for everyoccurence of a dot product, the above analysis of transformation invariance leads tothe regularizer 1 Xj=1 Xi=1 �iyi@1k(Bxj; Bxi) �B @@t ���t=0Ltxj!2 : (4.11)The derivative of k must be evaluated for speci�c kernels, e.g. for k(x;y) = (x � y)d,@1k(x;y) = d � (x � y)d�1 � y>: To obtain a kernel-speci�c constraint on the matrix B,one would need to equate the result with the quadratic term in the nonlinear objectivefunction, Xi;k �iyi�kykk(Bxi; Bxk): (4.12)Relationship to Principal Component Analysis. Let us now provide some interpre-tation of (4.9) and (4.6). The tangent vectors � @@t jt=0Ltxj have zero mean, thus C is asample estimate of the covariance matrix of the random vector s � @@t jt=0Ltx, s 2 f�1gbeing a random sign. Based on this observation, we call C (4.6) the Tangent Covari-ance Matrix of the data set fxi : i = 1; : : : ; `g with respect to the transformationsLt. Being positive de�nite,2 C can be diagonalized, C = SDS>, with an orthogo-nal matrix S consisting of C's Eigenvectors and a diagonal matrix D containing thecorresponding positive Eigenvalues. Then we can computeB = C� 12 = SD� 12S>; (4.13)where D� 12 is the diagonal matrix obtained from D by taking the inverse square2it is understood that we use C� if C is not de�nite (cf. (4.10))

References

Page 164: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.2. INCORPORATING TRANSFORMATION INVARIANCES 107roots of the diagonal elements. Since the dot product is invariant under orthogonaltransformations, we may drop the leading S and (4.1) becomesg(xj) = Xi=1 �iyi(D� 12S>xj �D� 12S>xi) + b: (4.14)A given pattern x is thus �rst transformed by projecting it onto the Eigenvectors ofthe tangent covariance matrix C, which are the rows of S>. The resulting featurevector is then rescaled by dividing by the square roots of C's Eigenvalues.3 In otherwords, the directions of main variance of the random vector @@t jt=0Ltx are scaled back,thus more emphasis is put on features which are less variant under Lt. For example, inimage analysis, if the Lt represent translations, more emphasis is put on the relativeproportions of ink in the image rather than the positions of lines. The PCA inter-pretation of our preprocessing matrix suggests the possibility to regularize and reducedimensionality by discarding part of the features, as it is common usage when doingPCA. As an aside, note that the resulting matrix will still satisfy (4.8).4Combining the PCA interpretation with the considerations following (4.1) leadsto an interesting observation: by computing the tangent covariance matrix from theSVs only, rather than from the full data set, it can be rendered a task-dependentcovariance matrix. Although the summation in (4.6) does not take into account classlabels yi, it then implicitely depends on the task to be solved via the SV set, whichis computed for given the task. Thus, it allows the extraction of features which areinvariant in a task-dependent way: it does not matter whether features for \easy"patterns change with transformations, it is more important that the \hard" patterns,close to the decision boundary, lead to invariant features.The Nonlinear Tangent Covariance Matrix. We are now in a position to describea feasible way how to generalize to the nonlinear case. To this end, we use kernelprincipal component analysis (Chapter 3). This technique allows us to compute prin-cipal components in a space F nonlinearly related to input space. The kernel functionk plays the role of the dot product in F , i.e. k(x;y) = (�(x) � �(y)). To generalize(4.14) to the nonlinear case, we compute the tangent covariance matrix C (Eq. 4.6)in feature space F , and its projection onto the subspace of F which is given by thelinear span of the tangent vectors in F . There, the considerations of the linear case3As an aside, note that our goal to build invariant SV machines has thus serendipitously providedus with an approach for an open problem in SV learning, namely the one of scaling: in SV machines,there has so far been no way of automatically assigning di�erent weight to di�erent directions in inputspace | in a trained SV machine, the weights of the �rst layer (the SVs) form a subset of the trainingset. Choosing these Support Vectors from the training set only gives rather limited possibilities forappropriately dealing with di�erent scales in di�erent directions of input space.4To see this, �rst note that if B solves B>BCB>B = B>B, and B's polar decomposition isB = UBs, with UU> = 1 and Bs = B>s , then Bs also solve it. Thus, we may restrict ourselves tosymmetrical solutions. For our choice B = C� 12 , B commutes with C, hence they can be diagonalizedsimultaneously. In this case, B2CB2 = B2 clearly can also be satis�ed by any matrix which is obtainedfrom B by setting an arbitrary selection of Eigenvalues to 0 (in the diagonal representation).

References

Page 165: A quick introduction to Statistical Learning Theory and Support Vector Machines

108 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESapply. The whole procedure reduces to computing dot products in F , which can bedone using k, without explicitly mapping into F :In rewriting (4.6) for the nonlinear case, we substitute �nite di�erences, with t > 0,for derivatives: C := 1t2 Xj=1 (�(Ltxj)� �(xj)) (�(Ltxj)� �(xj))> : (4.15)For the sake of brevity, we have omitted the summands corresponding to derivativesin the opposite direction, which ensure that the data set is centered. For the �naltangent covariance matrix C, they do not make a di�erence, as the two negative signscancel out.In high-dimensional feature spaces, C cannot be computed explicitely. In completeanalogy to Chapter 3, we compute another matrix whose Eigenvalues and Eigenvectorswill allow us to extract features corresponding to Eigenvectors and Eigenvalues ofC. This is done by taking dot products from both sides with �(Ltxi) � �(xi) (theEigenvectors in F can be expanded in terms of the latter, by the same argument asthe one leading to (3.10)). De�ningKij = k(xi;xj); (4.16)Ktij = k(xi;Ltxj) + k(Ltxi;xj); (4.17)and Kttij = k(Ltxi;Ltxj); (4.18)we get (�(Ltxi)� �(xi))>C(�(Ltxk)� �(xk))= 1`t2 Xj=1(Kttij �Ktij +Kij)(Kttjk �Ktjk +Kjk)= � 1`t2 (Ktt �Kt +K)2�ik : (4.19)Using (4.19), and Eigenvector expansionsV = Xk=1�k(�(Ltxk)� �(xk)); (4.20)the Eigenvalue problem that we need to solve (cf. (3.9)),�(�(Ltxi)� �(xi))> Xk=1�k(�(Ltxk)� �(xk))= (�(Ltxi)� �(xi))>C Xk=1�k(�(Ltxk)� �(xk)); (4.21)

References

Page 166: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.3. IMAGE LOCALITY AND LOCAL FEATURE EXTRACTORS 109then takes the form �(Ktt �Kt +K)� = 1t2 (Ktt �Kt +K)2�: (4.22)To �nd solutions of (4.22), we solve the Eigenvalue problem (cf. (3.14))5�� = 1t2 (Ktt �Kt +K)�: (4.23)Normalization of each Eigenvector (4.20) is carried out by requiring (V �V ) = 1, which,as in (3.16), translates into �(� ��) = 1; (4.24)using the corresponding Eigenvalue �.Feature extraction for a test point x is done by computing the projection of �(x)onto Eigenvectors V,V>�(x) = Xk=1�k(�(Ltxk)� �(xk))>�(x)= Xk=1�k(k(Ltxk;x)� k(xk;x)): (4.25)In Appendix D.3, we give an alternative justi�cation of this procedure, whichnaturally arises from requiring invariance in feature space, without the need for aPCA interpretation.4.3 Image Locality and Local Feature ExtractorsBy using a kernel k(x;y) = (x � y)d, one implicitly constructs a decision boundaryin the space of all possible products of d pixels. This may not be desirable, since innatural images, correlations over short distances are much more reliable as featuresthan long-range correlations are. To take this into account, we de�ne a kernel kd1;d2pas follows (cf. Fig. 4.3):1. compute a third image z, de�ned as the pixel-wise product of x and y2. sample z with pyramidal receptive �elds of diameter p, centered at all locations(i; j), to obtain the values zij3. raise each zij to the power d1, to take into account local correlations within therange of the pyramid5If we expand V in a di�erent set of vectors, we instead arrive at a problem of simultaneousdiagonalization of two matrices.

References

Page 167: A quick introduction to Statistical Learning Theory and Support Vector Machines

110 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINES(...)d1

Σd2

x y

(...)d1

FIGURE 4.3: Kernel utilizing local correlations in images. To compute k(x;y) for twoimages x and y, we sum over products between corresponding pixels of the two imagesin localized regions (in the �gure, this is indicated by dot products (: � :)), weighed bypyramidal receptive �elds. To the outputs, a �rst nonlinearity in form of an exponent d1is applied. The resulting numbers for all patches (only two are displayed) are summed,and the d2-th power of the result is taken as the value k(x;y). This kernel correspondsto a dot product in a polynomial space which is spanned mainly by localized correlationsbetween pixels (see Sec. 4.3).4. sum zd1ij over the whole image, and raise the result to the power d2 to allow forlonge-range correlations of order d2The resulting kernel will be of order d1 � d2, however, it will not contain all possiblecorrelations of d1 � d2 pixels.4.4 Experimental Results4.4.1 Virtual Support VectorsUSPS Digit Recognition. The �rst set of experiments was conducted on the USPSdatabase of handwritten digits (Appendix C). This database has been used extensivelyin the literature, with a LeNet1 Convolutional Network achieving a test error rate of5.0% (LeCun et al., 1989). As in Sec. 2.3, we used = 10.Virtual Support Vectors were generated for the set of all di�erent Support Vectorsof the ten classi�ers. Alternatively, one can carry out the procedure separately forthe ten binary classi�ers, thus dealing with smaller training sets during the training ofthe second machine. Table 4.1 shows that incorporating only translational invariance

References

Page 168: A quick introduction to Statistical Learning Theory and Support Vector Machines

112 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESalready improves performance signi�cantly, from 4.0% to 3.2% error rate. For othertypes of invariances (Fig. 4.4), we also found improvements, albeit smaller ones: gen-erating Virtual Support Vectors by rotation or by the line thickness transformationof Drucker, Schapire, and Simard (1993), we constructed polynomial classi�ers with3.7% error rate (in both cases).Note, moreover, that generating Virtual examples from the full database ratherthan just from the SV sets did not improve the accuracy, nor did it enlarge the SV set ofthe �nal classi�er substantially. This �nding was reproduced for the Virtual SV systemmentioned in Sec. 2.5.3: in that case, similar to Table 4.1, generating Virtual examplesfrom the full database led to identical performance, and only slightly increased SV setsize (861 instead of 806). From this, we conclude that for the considered recognitiontask, it is su�cient to generate Virtual examples only from the SVs |Virtual examplesgenerated from the other patterns do not add much useful information.MNIST Digit Recognition. The larger a database, the more information aboutinvariances of the decision function is already contained in the di�erences betweenpatterns of the same class. To show that it is nevertheless possible to improve classi�-TABLE 4.2: Application of the Virtual SV method to the MNIST database. Virtual SVswere generated by translating the original SVs in all four principal directions (by 1 pixel).Results are given for the original SV machine, and two VSV systems utilizing di�erentkernel degrees; in all cases, we used = 10 (cf. (2.19)). SV: degree 5 polynomial SVclassi�er; VSV1: VSV machine with degree 5 polynomial kernel; VSV2: same with degree9 kernel. The �rst table gives the performance: for the ten binary recognizers, as numbersof errors; for multi class-classi�cation (T1), in terms of error rates (in %), both on the60000 element test set. The second multi-class error rate (T2) was computed by testingonly on a 10000 element subset of the full 60000 element test set. These results are givenfor reference purpose, they are the ones usually reported in MNIST performance studies.The second table gives numbers of SVs for all ten binary digit recognizers.Errorsbinary recognizers 10-classsystem 0 1 2 3 4 5 6 7 8 9 T1 T2SV 131 97 243 240 212 241 195 259 343 409 1.6 1.4VSV1 95 84 186 176 173 171 127 217 233 289 1.1 1.0VSV2 81 66 164 146 141 147 119 179 196 254 1.0 0.8Support Vectorssystem 0 1 2 3 4 5 6 7 8 9SV 1206 757 2183 2506 1784 2255 1347 1712 3053 2720VSV1 2938 1887 5015 4764 3983 5235 3328 3968 6978 6348VSV2 3941 2136 6598 7380 5127 6466 4128 5014 8701 7720

References

Page 169: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.4. EXPERIMENTAL RESULTS 113cation accuracies with our technique, we applied the method to the MNIST database(Appendix C) of 60000 handwritten digits. This database has become the standardfor performance comparisons at AT&T Bell Labs; the error rate record of 0.7% is heldby a boosted LeNet4 (Bottou et al., 1994; LeCun et al., 1995), i.e. by an ensembleof learning machines. The best single machine in the performance comparisons so farwas a LeNet5 convolutional neural network (0.9%); other high performance systemsinclude Tangent Distance nearest neighbour classi�ers (1.1%), and LeNet4 with a lastlayer using methods of local learning (1.1%, cf. Bottou and Vapnik, 1992).Using Virtual Support Vectors generated by 1-pixel translations, we improved adegree 5 polynomial SV classi�er from 1.4% to 1.0% error rate on the 10000 elementtest set (Table 4.2). In this case, we applied our technique separately for all ten SupportVector sets of the binary classi�ers (rather than for their union) in order to avoidhaving to deal with large training sets in the retraining stage. Note, moreover, thatfor the MNIST database, we did not compare results of the VSV technique to those forgenerating Virtual examples from the whole database: the latter is computationallyexceedingly expensive, as it entails training on a very large training set. We did,however, make a comparison for the small MNIST database (Appendix C). There, adegree 5 polynomial classi�er was improved from 3:8% to 2:5% error by the Virtual SVmethod, with an increase of the average SV set sizes from 324 to 823. By generatingVirtual examples from the full training set, and retraining on these, we obtained asystem which had slightly more SVs (939), but an unchanged error rate.After retraining, the number of SVs more than doubled (Table 4.2). Thus, althoughthe training sets for the second set of binary classi�ers were substantially smaller thanthe original database (for four Virtual SVs per SV, four times the size of the originalSV sets, in our case amounting to around 104), we concluded that the amount ofdata in the region of interest, close to the decision boundary, had more than doubled.Therefore, we reasoned that it should be possible to use a more complex decisionfunction in the second stage (note that the risk bound (1.5) depends on the ratio ofVC-dimension and training set size). Indeed, using a degree 9 polynomial led to anerror rate of 0.8%, very close to the record performance of 0.7%.Another interesting performance measure is the rejection error rate, de�ned as thepercentage of patterns that would have to be rejected to attain a speci�ed error rate (inthe benchmark studies of Bottou et al. (1994) and LeCun et al. (1995), 0.5%). Notethat this percentage is computed on the test set. In our case, using the con�dencemeasure of Sec. 2.1.6, it was measured to be 0.9%, realizing a large improvementcompared to the original SV system (2.4%). In the above benchmark studies, only theboosted LeNet4 ensemble performed better (0.5%).Further improvements can possibly be achieved by combining di�erent types ofinvariances. Another intriguing extension of the scheme would be to use techniquesbased on image correspondence (e.g. Vetter and Poggio, 1997) to extract invariancetransformations from the training set. Those transformations can then be used togenerate Virtual Support Vectors.66Together with Thomas Vetter, we have recently started working on this approach.

References

Page 170: A quick introduction to Statistical Learning Theory and Support Vector Machines

114 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESFIGURE 4.5: Virtual SVs in gender classi�cation. A: 2-D image of a 3-D head model (fromthe MPI head database (Troje and B�ultho�, 1996; Vetter and Troje, 1997)); B: 2D imageof the rotated 3D head; C: arti�cial image, generated from A using the assumption that itbelongs to a cylinder-shaped 3D object (rotation by the same angle as B).A B CTABLE 4.3: Numbers of test errors for gender classi�cation in novel pose, using VirtualSVs (qualitatively similar to Fig. 4.5). The training set contained 100 views of male andfemale heads (divided 49:51), taken at an azimuth of 24�, downsampled to 32� 32. Thetest set contained 100 frontal views of the same heads. We used polynomial SV classi�ersof di�erent degrees, generating one virtual SV per original SV. Clearly, training and testviews are di�erently distributed, however, the amount of rotation (24�) was known to theclassi�er in the sense that it was used for generating the Virtual SVs (Fig. 4.5): �rst, asimpli�ed head model was inferred by averaging over in-depth revolutions of all the 2Dviews. VSVs were generated by projecting the original SVs onto the head model, thenrotating the head to the frontal view, and computing the new 2-D view.degreeprior knowledge 1 2 3 4 5no virtual SVs 25 24 23 21 19virtual SVs from 3D model 11 10 10 9 10Face Classi�cation. Certain types of transformations, as the above used translationsand rotations, apply equally well to object recognition as they do to character recog-nition. There are, however, types of transformations which are speci�c to the class ofimages considered (cf. Sec. 1.1). For instance, line thickness transformations (Fig. 4.4)are speci�c to character recognition. To provide an example of Virtual SVs which arespeci�c to object recognition, we generated virtual SVs corresponding to object rota-tions in depth, by making assumptions about the 3D shape of objects. Clearly, such anapproach would have a hard time if applied to complex objects as chairs (Appendix A).For human heads, however, it is possible to formulate 2-D image transformations whichcan be applied to generate approximate novel views of heads (Fig. 4.5). Using theseviews improved accuracies in a small gender classi�cation experiment. Table 4.3 givesdetails and results of the experiment.

References

Page 171: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.4. EXPERIMENTAL RESULTS 115TABLE 4.4: Test error rates for two object recognition databases, for views of resolution16�16, using di�erent types of approximate invariance transformations to generate VirtualSVs, and polynomial kernels of degree 20 (cf. Table 2.1). The second training run in theVirtual SV systems was done on the original SVs and the generated Virtual SVs. Thetraining sets with 25 and 89 views per object are regularly spaced; for them, mirroring doesnot provide additional information. The interesting case is the one where we trained on the100-view-per-object sets. Here, a combination of virtual SVs from mirroring and rotationsubstantially improves accuracies on both databases.database: animal entry leveltraining set: views per objectVirtual SVs 25 89 100 25 89 100none (orig. system) 13.0 1.7 4.8 13.0 1.8 2.4mirroring 13.6 1.8 4.8 14.2 2.8 3.2translations 16.4 1.6 4.3 17.1 11.1 4.8rotations 9.0 0.7 3.0 10.3 1.8 2.5rotations & mirroring 9.0 0.7 1.7 9.6 0.9 1.7Discrete Symmetries in Object Recognition. As mentioned above, rigid transfor-mations of 3-D objects, however, do not in general correspond to simple transfor-mations of the produced 2-D images (cf. Sec. 4.4.1). For the MPI object databases(Appendix A), however, there exists a type of invariance transformation which caneasily be computed from the images: as all the objects used are (approximately) bilat-erally symmetric, we can easily produce another valid view of the same object, witha di�erent viewing angle, by performing a mirror operation with respect to a verticalaxis in the center of the images, say (Vetter, Poggio, and B�ultho�, 1994). If the ob-jects were exactly symmetric, we would not expect any additional information to begained in the case of the regularly spaced object sets (25 and 89 views per object),as in these the snapshots are already sampled symmetrically around the zero viewdirection, which in most cases coincided with the symmetry plane. The slight decreasein performance in that case (Table 4.4) indicates that for some objects, the symmetryonly holds approximately (for snapshots, see Appendix A).To get more robust, we tried combining this type of invariance transformationwith other types. As in the case of character recognition, we simply used translations(by 1 pixel in all four directions) and image-plane rotations (by 10 degrees in bothdirections). Even though these transformation are but very crude approximations oftransformations which occur when a 3-D object is rotated in space, they did in somecases yield signi�cant performance improvements.77The following may serve as a partial explanation why rotations were more useful than translations.First, di�erent snapshots at large elevations can be transformed into each other by an approximateimage plane rotation, and second, image plane rotations retain the centering which was applied to

References

Page 172: A quick introduction to Statistical Learning Theory and Support Vector Machines

116 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESTo examine the e�ect of the mirror symmetry Virtual SVs, we need to focus on thenon-regularly spaced training set with 100 views per object. There, by far the bestperformance for both the entry level and the animal database was obtained by usingboth mirroring and rotations (Table 4.4).TABLE 4.5: Speed improvement using the Reduced Set method. The second throughfourth columns give numbers of errors on the 10000 element MNIST test set for theoriginal system, the Virtual Support Vector system, and the reduced set system (for the10-class classi�ers, the error is given in %). The last three columns give, for each system,the number of vectors whose dot product must be computed in the test phase.Digit SV err VSV1 err RS err SV # VSV1 # RS #0 17 15 18 1206 2938 591 15 13 12 757 1887 382 34 23 30 2183 5015 1003 32 21 27 2506 4764 954 30 30 35 1784 3983 805 29 23 27 2255 5235 1056 30 18 24 1347 3328 677 43 39 57 1712 3968 798 47 35 40 3053 6978 1409 56 40 40 2720 6348 12710-class 1.4% 1.0% 1.1%Virtual SV Combined with Reduced Set. Apart from the increase in overall trainingtime (by a factor of two, in our experiments), the VSV technique has the computationaldisadvantage that many of the Virtual Support Vectors become Support Vectors forthe second machine, increasing the cost of evaluating the decision function (2.25).However, the latter problem can be solved with the Reduced Set (RS) method (Burges,1996, see Appendix D.1.1), which reduces the complexity of the decision functionrepresentation by approximating it in terms of fewer vectors. In a study combiningthe VSV and RS methods, we achieved a factor of �fty speedup in test phase over theVirtual Support Vector machine, with only a small decrease in performance (Burgesand Sch�olkopf, 1997). We next brie y report the results of this study. The RS resultsreported were obtained by Chris Burges.As a starting point for the RS computation, we used the VSV1 machine (Ta-ble 4.2), which achieved 1.0% error rate on the 10000 element MNIST test set.8 Thethe original images. Both points suggest that virtual examples generated by rotations should be more\realistic" than those generated by translations.8At the time when the described study was carried out, VSV1 was our best system; VSV2 wasnot available yet.

References

Page 173: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.4. EXPERIMENTAL RESULTS 117improvement in accuracy compared to the SV machine (Table 4.2) comes at a cost inclassi�cation speed of approximately a factor of 2. Furthermore, the speed of SV wascomparatively slow to start with (cf. LeCun et al., 1995), requiring approximately 14million multiply adds for one classi�cation. In order to become competitive with sys-tems with comparable accuracy (LeCun et al., 1995), we need approximately a factorof �fty improvement in speed. We therefore approximated VSV1 with a reduced setsystem RS with a factor of �fty fewer vectors than the number of Support Vectors inVSV1.Table 4.5 compares results on the 10000 element test set for the three systems.Overall, the SV machine performance of 1.4% error is improved to 1.1%, with a ma-chine requiring a factor of 22 fewer multiply adds (RS). For details on the computationof the RS solution, see (Burges and Sch�olkopf, 1997).4.4.2 Invariant Hyperplane MethodIn the experiments exploring the invariant hyperplane method (Sec. 4.2.2), we used thesmall MNIST database (Appendix C). We start by giving some baseline classi�cationresults.Using a standard linear SV machine (i.e. a separating hyperplane, Sec. 2.1.3), weobtain a test error rate of 9:8%; by using a polynomial kernel of degree 4, this dropsto 4:0%. In all of the following experiments, we use degree 4 kernels of various types.The number 4 was chosen as it can be written as a product of two integers, thus wecould compare results to a kernel kd1;d2p with d1 = d2 = 2 (cf. sections 4.3 and 4.4.3).For the considered classi�cation task, results for higher polynomial degrees are verysimilar.In a series of experiments with a homogeneous polynomial kernel k(x;y) = (x �y)4,using preprocessing with Gaussian smoothing kernels of standard deviation in therange 0:1; 0:2; : : : ; 1:0, we obtained error rates which gradually increased from 4:0%to 4:3%. We concluded that no improvement of the original 4:0% performance waspossible by a simple smoothing operation.Invariant Hyperplanes Results. Table 4.6 reports results obtained by preprocessingall patterns with B (cf. (4.9)), choosing di�erent values of � (cf. Eq. (4.10)). In theTABLE 4.6: Classi�cation error rates for modifying the kernel k(x;y) = (x � y)4 with theinvariant hyperplane preprocessing matrix B� = C� 12� ; cf. Eqs. (4.9) { (4.10). Enforcinginvariance with � = 0:2; 0:3; : : : ; 0:9 leads to improvements over the original performance(� = 1). � 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0error rate in % 4.2 3.8 3.6 3.6 3.7 3.8 3.8 3.9 3.9 4.0

References

Page 174: A quick introduction to Statistical Learning Theory and Support Vector Machines

118 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINES

FIGURE 4.6: The �rst pattern in the small MNIST database, preprocessed with B� =C� 12� (cf. equations (4.9) { (4.10)), enforcing various amounts of invariance. Top row:� = 0:1; 0:2; 0:3; 0:4; bottom row: � = 0:5; 0:6; 0:7; 0:8. For some values of , thepreprocessing resembles a smoothing operation, however, it leads to higher classi�cationaccuracies (see Sec. 4.4.2) than the latter.experiments, the patterns were �rst rescaled to have entries in [0; 1], then B wascomputed, using horizontal and vertical translations, and preprocessing was carriedout; �nally, the resulting patterns were scaled back again (for snapshots of the resultingpatterns, see Fig. 4.6). The scaling was done to ensure that patterns and derivativeslie in comparable regions ofRN (note that if the pattern background level is a constant�1, then its derivative is 0). The results show that even though (4.6) was derived forthe linear case, it leads to improvements in the nonlinear case (here, for a degree 4polynomial).Dimensionality Reduction. The above [0; 1] scaling operation is a�ne rather thanlinear, hence the argument leading to (4.14) does not hold for this case. We thus onlyreport results on dimensionality reduction for the case where the data is kept in [0; 1]scaling during the whole procedure. Dropping principal components which are lessimportant leads to substantial improvements (Table 4.7); cf. the explanation following(4.14)).The results in Table 4.7 are somewhat distorted by the fact that the polynomialkernel is not translation invariant, and performs poorly when none of the principalTABLE 4.7: Dropping directions corresponding to small Eigenvalues of C, i.e. droppingless important principal components (cf. (4.14)), leads to substantial improvements. Allresults given are for the case � = 0:4 (cf. Table 4.6); degree 4 homogeneous polynomialkernel. PCs discarded 0 50 100 150 200 250 300 350error rate in % 8.7 5.4 4.9 4.4 4.2 3.9 3.7 3.9

References

Page 175: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.4. EXPERIMENTAL RESULTS 119components are discarded. Hence this result should not be compared to the perfor-mance of the polynomial kernel on the data in [�1; 1] scaling. (Recall that we obtained3.6% in that case, for � = 0:4.) In practice, of course, we may choose the scaling of thedata as we like, in which case it would seem pointless to use a method which is onlyapplicable for a rather disadvantageous representation of the data. However, nothingprevents us from using a translation invariant kernel. We opted for a radial basis func-tion kernel (2.27) with c = 0:5. On the [�1; 1] data, for � = 0:4, this leads to the sameperformance as the degree 4 polynomial, 3.6% (without invariance preprocessing, i.e.for � = 1, the performance is 3.9%). To get the identical system on [0; 1] data, theRBF width was rescaled accordingly, to c = 0:125. Table 4.8 shows that discardingprincipal components can further improve performance, up to 3.3%.TABLE 4.8: Dropping directions corresponding to small Eigenvalues of C, i.e. droppingless important principal components (cf. (4.14)), for the translation invariant RBF kernel(see text). All results given are for the case � = 0:4 (cf. Table 4.6).PCs discarded 0 50 100 150 200 250 300 350error rate in % 3.6 3.6 3.6 3.5 3.5 3.4 3.3 3.64.4.3 Kernels Using Local CorrelationsCharacter Recognition. As in Sec. 4.4.2, the present results were obtained on thesmall MNIST database (Appendix C). As a reference result, we use the degree 4polynomial SV machine, performing at 4:0% error (Sec. 4.4.2). To exploit locality inimages, we used pyramidal receptive �eld kernel kd1;d2p with diameter p = 9 (cf. Sec. 4.3)and d1 � d2 = 4, i.e. degree 4 polynomials kernels which do not use all products of 4pixels (Sec. 4.2.2). For d1 = d2 = 2, we obtained an improved error rate of 3:1%,another degree 4 kernel with only local correlations (d1 = 4; d2 = 1) led to 3:4%(Table 4.9).Albeit better than the 4:0% for the degree 4 homogeneous polynomial, this isstill worse than the Virtual SV result: generating Virtual SVs by image translations,the latter led to 2:8%. As the two methods, however, exploit di�erent types of priorknowledge, it could be expected that combining them leads to still better performance;and indeed, this yielded the best performance of all (2:0%), halving the error rate ofthe original system.For the purpose of benchmarking, we also ran our system on the USPS database.In that case, we obtained the following test error rates: SV with degree 4 polynomialkernel 4:2% (Table 2.4), Virtual SV (same kernel) 3:5%, SV with k2;27 3.6% (for thesmaller USPS images, we used a k7 kernel rather than k9), Virtual SV with k2;27 3:0%.The latter compares favourably to almost all known results on that database, and issecond only to a memory-based tangent-distance nearest neighbour classi�er at 2:6%(Simard, LeCun, and Denker, 1993).

References

Page 176: A quick introduction to Statistical Learning Theory and Support Vector Machines

120 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESTABLE 4.9: Summary: error rates for various methods of incorporating prior knowledge,on the small MNIST database (Appendix C). In all cases, degree 4 polynomial kernels wereused, either of the local type (Sec. 4.3), or (by default) of the complete polynomial type(2.26). In all cases, we used = 10 (cf. (2.19)).Classi�er Test Error / %SV 4.0Virtual SV (Sec. 4.2.1), with translations 2.8Invariant hyperplane (Sec. 4.2.2), � = 0:4 3.6same, on �rst 100 principal components (Table 4.7) 3.7semi-local kernel k2;29 (Sec. 4.4.3) 3.1purely local kernel k4;19 (Sec. 4.4.3) 3.4Virtual SV with k2;29 2.0Object Recognition. The above results have been con�rmed on the two object re-cognition databases used in Sec. 2.2.1 (cf. Appendix A). As in the case of the smallMNIST database, we used kd1;d29 . In the present case, we chose d1 = d2 = 3, whichyields a degree 9 (= 3 � 3) polynomial classi�er which di�ers from a standard poly-nomial (2.26) in that it does not utilize all products of 9 pixels, but mainly localones. Comparing the results to those obtained with standard polynomials of equal de-gree shows that this pre-selection of useful features signi�cantly improves recognitionresults (Table 4.10).As in the case of digit recognition, we combined this method with the Virtual SVmethod (Sec. 4.2.1). Based on the fact that prior knowledge about image localityis di�erent from prior knowledge on invariances, we expected the possibility to getfurther improvements. We used the same types of Virtual SVs as in Sec. 4.4.1. Theresults (Table 4.11) further improve upon Table 4.10, con�rming the digit recognition�ndings reported above. In 4 of 6 cases, the resulting classi�ers are better than thoseof Table 4.4.94.5 DiscussionFor Support Vector learning machines, invariances can readily be incorporated by gen-erating virtual examples from the Support Vectors, rather than from the whole trainingset. The method yields a signi�cant gain in classi�cation accuracy at a moderate costin time: it requires two training runs (rather than one), and it constructs classi�cationrules utilizing more Support Vectors, thus slowing down classi�cation speed (cf. (2.25))| in our case, both points amounted to a factor of about 2. Given that Support Vector9Note that in Table 4.4, the VSV method was used for degree 20 kernels, which on the objectrecognition tasks does far better than degree 9, cf. Table 2.1.

References

Page 177: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.5. DISCUSSION 121TABLE 4.10: Test error rates for two object recognition tasks, comparing kernels local inthe image to complete polynomial kernels. Local kernels of degree 9 outperform completepolynomial kernels of corresponding degree. Moreover, they performed at least as well asthe best polynomial classi�er out of all degrees in f1; 3; 6; 9; 12; 15; 20; 25g (cf. Table 2.1).kernel: degree 9 polyn. best polynomial k3;39 (cf. Sec. 4.3)entry level:25 grey scale 13.9 13.0 12.089 grey scale 2.0 1.8 1.8100 grey scale 3.5 2.4 2.025 silhouettes 16.7 15.4 15.089 silhouettes 2.7 2.2 2.1100 silhouettes 4.8 4.0 3.9animals:25 grey scale 14.8 13.0 12.089 grey scale 2.5 1.7 1.6100 grey scale 5.2 4.4 4.025 silhouettes 17.0 15.6 15.289 silhouettes 2.8 2.2 2.0100 silhouettes 6.3 5.2 4.9TABLE 4.11: Test error rates for two object recognition databases, using di�erent types ofapproximate invariance transformations to generate Virtual SVs (as in Table 4.4), and localpolynomial kernels k3;39 of degree 9 (cf. Sec. 4.3, Table 4.10, Table 4.4, and Table 2.1).The second training run in the Virtual SV systems was done on the original SVs and thegenerated Virtual SVs. The training sets with 25 and 89 views per object are regularlyspaced; for them, mirroring does not provide additional information. For the non-regularlyspaced 100-view-per-object sets, a combination of virtual SVs from mirroring and rotationsubstantially improves accuracies on both databases.database: animal entry leveltraining set: views per objectVirtual SVs 25 89 100 25 89 100none (orig. system) 12.0 1.6 4.0 12.0 1.8 2.0mirroring 12.5 1.7 4.6 13.1 2.9 3.3rotations & mirroring 8.8 1.0 1.4 8.5 1.2 1.6References

Page 178: A quick introduction to Statistical Learning Theory and Support Vector Machines

122 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINESmachines are known to allow for short training times (Bottou et al., 1994), the �rstpoint is usually not critical. Certainly, training on virtual examples generated fromthe whole database would be signi�cantly slower. To compensate for the second point,we used the reduced set method of Burges (1996) for increasing speed. This way, weobtained a system which was both fast and accurate.As an alternative approach, we have built in known invariances directly into theSVM objective function via the choice of a kernel. With its rather general class ofadmissible kernel functions, the SV algorithm provides ample possibilities for con-structing task-speci�c kernels. We have considered two forms of domain knowledge:�rst, pattern classes were required to be locally translationally invariant, and second,local correlations in the images were assumed to be more reliable than long-range corre-lations. The second requirement can be seen as a more general form of prior knowledge| it can be thought of as arising partially from the fact that patterns possess a wholevariety of transformations; in object recognition, for instance, we have object rotationsand deformations. Mostly, these transformations are continuous, which implies thatlocal relationships in an image are fairly stable, whereas global relationships are lessreliable.Both types of domain knowledge led to improvements on the considered patternrecognition tasks.The method for constructing kernels for transformation invariant SV machines(invariant hyperplanes), put forward to deal with the �rst type of domain knowledge,so far has only been applied in the linear case, which probably explains why it only ledto moderate improvements, especially when compared with the large gains achievedby the Virtual SV method. It is applicable for di�erentiable transformations | othertypes, e.g. for mirror symmetry, have to be dealt with using other techniques, as theVirtual Support Vector method. Its main advantages compared to the latter techniqueis that it does not slow down testing speed, and that using more invariances leavestraining time almost unchanged. In addition, it is more attractive from a theoreticalpoint of view, establishing a surprising connection to invariant feature extraction,preprocessing, and principal component analysis.The proposed kernels respecting locality in images, on the other hand, led to largeimprovements; they are applicable not only in image classi�cation but to all cases wherethe relative importance of subsets of products features can be speci�ed appropriately.They do, however, slow down both training and testing by a constant factor whichdepends on the cost of evaluating the speci�c kernel used.Clearly, SV machines have not yet been developed to their full potential, whichcould explain the fact that our highest accuracies are still slightly worse that therecord on the MNIST database. However, SVMs present clear opportunities for furtherimprovement. More invariances (for example, for the pattern recognition case, smallrotations, or varying ink thickness) could be incorporated, possibly combined withtechniques for dealing with optimization problems involving very large numbers ofSVs (Osuna, Freund, and Girosi, 1997). Further, one might use only those VirtualSupport Vectors which provide new information about the decision boundary, or use a

References

Page 179: A quick introduction to Statistical Learning Theory and Support Vector Machines

4.5. DISCUSSION 123measure of such information to keep only the most important vectors. Finally, if localkernels (Sec. 4.3) will prove to be as useful on the full MNIST database as they wereon the small version of it, accuracies could be substantially increased | at a cost inclassi�cation speed, though.We conclude this chapter by noting that all three described techniques should bedirectly applicable to other kernel-based methods as SV regression (Vapnik, 1995b) andkernel PCA (Chapter 3). Future work will include the nonlinear Tangent CovarianceMatrix (cf. our considerations in Sec. 4.2.2), the incorporation of invariances otherthan translation, and the construction of kernels incorporating local feature extractors(e.g. edge detectors) di�erent from the pyramids described in Sec. 4.3.

References

Page 180: A quick introduction to Statistical Learning Theory and Support Vector Machines

124 CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINES

References

Page 181: A quick introduction to Statistical Learning Theory and Support Vector Machines

Chapter 5ConclusionWe believe that Support Vector machines and Kernel Principal Component Analysisare only the �rst examples of a series of potential applications of Mercer-kernel-basedmethods in learning theory. Any algorithm which can be formulated solely in termsof dot products can be made nonlinear by carrying it out in feature spaces induced byMercer kernels. However, already the above two �elds are large enough to render anexhaustive discussion in this thesis infeasible. Thus, we have tried to focus on someaspects of SV learning and Kernel PCA, hoping that we have succeeded in illustratinghow nonlinear feature spaces can bene�cially be used in complex learning tasks.On the Support Vector side, we presented two chapters. Apart from a tutorialintroduction to the theory of SV learning, the �rst one focused on empirical resultsrelated to the accuracy and the Support vector sets of di�erent SV classi�ers. Con-sidering three well-known classi�er types which are included in the SV approach asspecial cases, we showed that they lead to similarly high accuracies and constructtheir decision surface from almost the same Support Vectors. Our �rst question raisedin the Preface was which of the observations should be used to construct the decisionboundary? Against the backdrop of our empirical �ndings, we can now take the po-sition that the Support Vectors, if constructed in an appropriate nonlinear featurespace, constitute such a subset of observations. The second SV chapter focused onalgorithms and empirical results for the incorporation of prior knowledge in SV ma-chines. We showed that this can be done both by modifying kernels and by generatingVirtual examples from the set of Support Vectors. In view of the high performancesobtained, we can reinforce and generalize the above answer, to include also VirtualSupport Vectors, and specialize it, saying that the appropriate feature space should beconstructed using prior knowledge of the task at hand. Our best performing systemsused both methods simultaneously, Virtual Support Vectors and kernels incorporatingprior knowledge about the local structure of images.On Kernel Principal Component Analysis, we presented one chapter, de-scribing the algorithm and giving �rst experimental results on feature extraction forpattern recognition. We saw that features extracted in nonlinear feature spaces led torecognition performances much higher than those extracted in input space (i.e. withtraditional PCA). This lends itself to an answer of the second question raised in the125

References

Page 182: A quick introduction to Statistical Learning Theory and Support Vector Machines

126 CHAPTER 5. CONCLUSIONPreface, which features should be extracted from each observation? From our presentpoint of view, these should be nonlinear Kernel PCA features. As Kernel PCA op-erates in the same types of feature spaces as Support Vector machines, the choice ofthe kernel, and the design of kernels to incorporate prior knowledge, should also be ofimportance here. As the Kernel PCA method is very recent, however, these questionshave not been thoroughly investigated yet. We hope that given a few years time, wewill be in a position to specialize our answer to the second question exactly as it wasdone for the �rst one.We conclude with an outlook, revisiting the question of visual processing in biolog-ical systems. If the Support Vector set should prove to be a characteristic of the datalargely independent of the type of learning machine used (which we have shown forthree types of learning machines), one would hope that it could also be of relevancein biological learning. If a subset of observations characterizes a task rather than aparticular algorithm's favourite examples, there is reason to hope that every systemtrying to solve this task | in particular animals | should make use of this subsetin one way or another. Regarding Kernel PCA, it would be interesting to study thetypes of feature extractors that Kernel PCA constructs when performed on collectionsof images resembling those that animals are naturally exposed to. Comparing thosewith the ones found in neurophysiological studies could potentially assist us in try-ing to understand natural visual systems. If applied on the same data, and similartasks, optimal machine learning algorithms could be as fruitful to biological thinkingas biological solutions can be to engineering.Support Vector Learning!References

Page 183: A quick introduction to Statistical Learning Theory and Support Vector Machines

Appendix AObject DatabasesIn this section, we brie y describe three object recognition databases (chairs, entrylevel objects, and animals) generated at the Max-Planck-Institut f�ur biologische Ky-bernetik (Liter et al., 1997). We start by describing the procedure for creating thedatabases, and then show some images of the resulting patterns.The training and test data was generated according to the following procedure(Blanz et al., 1996; Liter et al., 1997):Database GenerationSnapshot Sampling. 25 di�erent object models with uniform grey surface were ren-dered in perspective projection in front of a white background on a Silicon Graphicsworkstation using Inventor software. The initial images had a resolution of 256� 256pixels. In all viewing directions, the image plane orientation was such that the verticalaxis of the object was projected in an upright orientation. Thus, each view of anobject is fully characterised by two camera position angles, the elevation � (0� at thehorizon, and 90� from the top) and the azimuth � 2 [0�; 360�) (increasing clockwisewhen viewed from the top). Only views on the upper half of the viewing sphere wereused, i.e. � 2 [0�; 90�]. The directions of lighting and camera were chosen to coincide.For each database, we generated di�erent training sets: two of them consisted of 25and 89 equally spaced views of each object, respectively; the other one contained 100random views per object (cf. Fig. A.1).1 Thus, we obtained training sets of sizes 625,2225 and 2500, respectively. The test set of size 2500 comprised 100 random views ofeach object, independent from the above sets.Centering. The resulting grey level pictures were centered with respect to the centerof mass of the binarized image. As the objects were shown on a white background,the binarized image separates �gure from ground.Edge Detection. Four one-dimensional di�erential operators (vertical, horizontal,and two diagonal ones) were applied to the images, followed by taking the modulus.1In one case, we also generated a set with 400 random views per object.127

References

Page 184: A quick introduction to Statistical Learning Theory and Support Vector Machines

128 APPENDIX A. OBJECT DATABASESDownsampling. In all �ve resulting images, the resolution was reduced to 16� 16,leading to �ve images r0 : : : r4. In this representation, each view requires 5 � 16 � 16 =1280 pixels.Containing edge detection data, the parts r1 : : : r4 already provide useful featuresfor recognition algorithms. To study the ability of an algorithm to extract features byitself, one can alternatively use only the actual image part r0 of the data, and thustrain on the 256-dimensional downsampled images rather than on the 1280-dimensionalinputs. In our experiments, we used both variants of the databases.Standardization. On the chair database, the standard deviation of the 16�16 imageswith pixel values in [0; 1] was around 30 (measured on the training sets). We rescaledall databases, separately for each part r0 : : : r4, such that each part separately gives riseto training sets with standard deviation 30. This hardly a�ects the r0 part, however,it does change the edge detection parts r1; : : : ; r4. In the resulting 5�256-dimensionalrepresentation, the di�erent parts arising from edge detection, or just downsampling,have comparable scaling.Pixel Rescaling. Before we ran the algorithms on the databases, each pixel valuex was rescaled according to x 7! 2x � 1. Thus, the background level was �1, andmaximal intensities were about 1.DatabasesUsing the above procedure, three object recognition databases were generated.MPI Chair Database. The �rst object recognition database contains 25 di�erentchairs (�gures A.2, A.3, A.4). For benchmarking purposes, the downsampled views areavailable via ftp://ftp.mpik-tueb.mpg.de/pub/chair dataset/. As all 25 objects belongto the same object category, recognition of chairs in the database is a subordinate leveltask.MPI Entry Level Database. The entry level databases contains 25 objects (�guresA.5, A.6, A.7), for which psychophysical evidence suggests that they belong to di�erententry levels in object recognition (cf. Sec. 2.2.1).MPI Animal Database. The animal database contains 25 di�erent animals (�guresA.8, A.9, A.10). Note that some of these animals are also contained in the entry leveldatabase (Fig. A.5).References

Page 185: A quick introduction to Statistical Learning Theory and Support Vector Machines

Appendix CHandwritten Character DatabasesUS Postal Service Database. The US Postal Service (USPS) database (see Fig. C.1)contains 9298 handwritten digits (7291 for training, 2007 for testing), collected frommail envelopes in Bu�alo (cf. LeCun et al., 1989). Each digit is a 16 � 16 image,represented as a 256-dimensional vector with entries between �1 and 1. Preprocessingconsisted of smoothing with a Gaussian kernel of width � = 0:75.It is known that the USPS test set is rather di�cult | the human error rate is 2.5%(Bromley and S�ackinger, 1991). For a discussion, see (Simard, LeCun, and Denker,1993). Note, moreover, that some of the results reported in the literature for theUSPS set have been obtained with an enhanced training set: for instance, Drucker,Schapire, and Simard (1993) used an enlarged training set of size 9709, containingsome additional machine-printed digits, and note that this improves accuracies. Inour experiments, only 7291 training examples were used.MNIST Database. The MNIST database (Fig. C.2) contains 120000 handwrittendigits, equally divided into training and test set. The database is a modi�ed versionof NIST Special Database 3 and NIST Test Data 1. Training and test set consist ofpatterns generated by di�erent writers. The images were �rst size normalized to �tinto a 20� 20 pixel box, and then centered in a 28� 28 image (Bottou et al., 1994).Test results on the MNIST database which are given in the literature (e.g. Bottouet al., 1994; LeCun et al., 1995) for some reason do not use the full MNIST test setof 60000 characters. Instead, a subset of 10000 characters is used, consisting of thetest set patterns from 24476 to 34475. To obtain results which can be compared tothe literature, we also use this test set, although the larger one is preferable from thepoint of view of obtaining more reliable test error estimates.Small MNIST Database. The USPS database has been criticised (Burges, LeCun,private communication; Bottou et al. (1994)) as not providing the most adequateclassi�er benchmark. First, it only comes with a small test set, and second, the test setcontains a number of corrupted patterns which not even humans can classify correctly.The MNIST database, which is the currently used classi�er benchmark in the AT&Tand Bell Labs learning research groups, does not have these drawbacks; moreover, its149

References

Page 186: A quick introduction to Statistical Learning Theory and Support Vector Machines

Appendix DTechnical AddendaD.1 Feature Space and KernelsIn this section, we collect some material related to Mercer kernels and the correspondigfeature spaces. If not stated otherwise, we assume that k is a Mercer kernel (cf.Proposition 1.3.2), and � is the corresponding map into a feature space F such thatk(x;y) = (�(x) ��(y)).D.1.1 The Reduced Set MethodGiven a vector 2 F , written in terms of images of input patterns, = Xi=1 �i�(xi); (D.1)with �i 2 R, one can try to approximate it by0 = NzXi=1 �i�(zi); (D.2)with Nz << `, �i 2 R. To this end, we have to minimize� = k�0k2: (D.3)The crucial point is that even if � is not given explicitely, � can be computed (andminimized) in terms of kernels, using (�(x) � �(y)) = k(x;y) (Burges, 1996).In Sec. 4.4.1, this method is used to approximate Support Vector decision bound-aries in order to speed up classi�cation.D.1.2 Inverting the Map �If � is nonlinear, the dimension of the linear span of the �-images of a set of inputvectors fx1; : : : ;x`g can exceed the dimension of their span in input space. Thus,153

References

Page 187: A quick introduction to Statistical Learning Theory and Support Vector Machines

154 APPENDIX D. TECHNICAL ADDENDAwe need not expect that there is a pre-image under � for each vector that can beexpressed as a linear combination of the vectors �(x1); : : : ;�(x`). Nevertheless, itmight be desirable to have a means of constructing the pre-image in the case where itdoes exist.To this end, suppose we have a vector in F given in terms of an expansion of imagesof input data, with an unknown pre-image x0 under � in input space RN , i.e.�(x0) = Xj=1�j�(xj): (D.4)Then, for any x 2 RN , k(x0;x) = Xj=1�jk(xj;x): (D.5)Assume moreover that the kernel k(x;y) is an invertible function fk of (x � y),k(x;y) = fk((x � y)); (D.6)e.g. k(x;y) = (x �y)d with odd d, or k(x;y) = �((x �y)+�) with a strictly monotonicsigmoid function � and a threshold �. Given any a priori chosen basis of input spacefe1; : : : ; eNg, we can then expand x0 asx0 = NXi=1(x0 � ei)ei= NXi=1 f�1k (k(x0; ei))ei= NXi=1 f�1k 0@Xj=1�jk(xj; ei)1A ei: (D.7)By using (D.5), we thus reconstructed x0 from the values of dot products betweenimages (in F ) of training examples and basis elements.Clearly, a crucial assumption in this construction was the existence of the pre-imagex0. If this does not hold, then the discrepancyXj=1�j�(xj)� �(x0) (D.8)will be nonzero. There is a number of things that we could do to make the discrepancysmall:(a) We can try to �nd a suitable basis in which we expand the pre-images.(b) We can repeat the scheme, by trying to �nd a pre-image for the discrepancyvector. This problem has precisely the same structure as the original one (D.4), with

References

Page 188: A quick introduction to Statistical Learning Theory and Support Vector Machines

D.1. FEATURE SPACE AND KERNELS 155one more term in the summation on the right hand side. Iterating this method givesan expansion of the vector in F in terms of reconstructed approximate pre-images.(c) We have the freedom to choose the scaling of the vector in F . To see this, notethat for any nonzero �, we have, similar to (D.7),x0 = NXi=1 � � (x0 � ei=�)ei = NXi=1 �f�1k 0@Xj=1�jk(xj; ei=�)1A ei: (D.9)(d) Related to this scaling issue, we could also have started with��(x0) = Xj=1�j�(xj); (D.10)obtaining a reconstruction (cf. (D.7))x0 = NXi=1 f�1k 0@Xj=1 �j� � k(xj; ei)1A ei (D.11)with the property that (D.10) holds if such an x0 exists.The success of using di�erent values of � or � could be monitored by computingthe squared norm of the discrepancy, Xj=1�j�(xj)� ��(x0) 2; (D.12)which can be evaluated in terms of the kernel function.Finally, we note that same approach can also be applied for more general kernelfunctions which cannot be written as an invertible function of (x � y). All we needis a kernel which allows the reconstruction of (x � y) | and nothing prevents usfrom requiring the evaluation of the kernel on several pairs of points for this purpose.Consider the following example: assume thatk(x;y) = fk �kx� yk2� (D.13)with an invertible fk (e.g., if k is a Gaussian RBF function, cf. (1.28)). Then, by thepolarization identity, we have(x0 � ei) = 14 �kx0 + eik2 � kx0 � eik2� = 14 �f�1k (k(x0;�ei))� f�1k (k(x0; ei))� :(D.14)The same also works if k(x;y) = fk(kx� yk), e.g.: we just have to raise the results off�1k to the power of 2.Similar methods can be applied to deal with other kernels.

References

Page 189: A quick introduction to Statistical Learning Theory and Support Vector Machines

156 APPENDIX D. TECHNICAL ADDENDAD.1.3 Mercer KernelsIn this section, we give some further material related to Sec. 1.3.First, we mention that if a �nite number of Eigenvalues is negative, the expansion(1.25) is still valid. In that case, k corresponds to a Lorentzian symmetric bilinearform in a space with inde�nite signature. For the SV algorithm, this would entailproblems, as the optimization problem would become inde�nite. The diagonalizationrequired for kernel PCA, however, can still be performed, and (3.16) can be modi�edsuch that it allows for negative Eigenvalues. The main di�erence is that we can nolonger interpret the method as PCA in some feature space. Nevertheless, it could stillbe viewed as a type of nonlinear factor analysis.Next, we note that the polynomial kernels given in (1.17) satisfy Mercer's conditionsof Proposition 1.3.2. As compositions of continuous functions, they are continuous,thus we only need to show positivity, which follows immediately if we consider theirdot product representation(x � y)d = NFXi=1 (�d(x))i (�d(y))i : (D.15)Namely, more generally, if an integral operator kernel k admits a uniformly convergentdot product representation on some compact set C � C,1Xi=1 �i(x)�i(y); (D.16)it is necessarily positive: for f 2 L2(C), we haveZC�C 1Xi=1 �i(x)�i(y)! f(x)f(y) dx dy= 1Xi=1 ZC�C �i(x)f(x)�i(y)f(y) dx dy (D.17)= 1Xi=1 �ZC �i(x)f(x) dx�2 � 0; (D.18)establishing the converse of Proposition 1.3.2.We conclude this section with some considerations on Proposition 1.3.3. Is itpossible to give a more general class of kernels, such that the expansion (1.25) is nolonger valid, but the mapping of Proposition 1.3.3 can still be constructed? One wouldexpect that if k does not correspond to a compact operator (as it did in the case ofMercer kernels, cf. Dunford and Schwartz (1963); in fact, in the Mercer case, we evenhave trace class operators, cf. Nashed and Wahba (1974)), with a discrete spectrum,then the mapping (1.26) should no longer map into an l2 space, but into some separableHilbert space of functions on a non-discrete measure space.

References

Page 190: A quick introduction to Statistical Learning Theory and Support Vector Machines

D.1. FEATURE SPACE AND KERNELS 157To this end, let � be a map from input space into some Hilbert space H,� : x 7! fx; (D.19)and T � 0 be a positive bounded operator on H. Moreover, de�ne a kernelkT (x;y) := (fx � Tfy): (D.20)Then � : x 7! pT fx (D.21)clearly is a map such that kT (x;y) = (�(x) � �(y)): (D.22)As an aside, to see the connection to Mercer's theorem, we may formally set fx to be�x, and assume that T is an integral operator with kernel k. In this case, the righthand side of (D.20) would equal k(x;y).The connection to (1.26) becomes clearer if we use the spectral representation ofT , and construct a � di�erent from the one in (D.21): T can be written asT = U�MvU; (D.23)where v is a continuous function with corresponding multiplication operator Mv, U isa unitary operator U : H ! L2(R; �); (D.24)and � is a probability measure (the spectral measure of T ) (e.g. Reed and Simon,1980). Since T � 0, we have Mv � 0 and v � 0. Then, for all x and y,kT (x;y) = (fx � U�MvUfy) (D.25)= (Ufx �MvUfy) (D.26)= (qMvUfx �qMvUfy) (D.27)= (MpvUfx �MpvUfy) (D.28)= (�(x) � �(y)); (D.29)de�ning � : RN ! L2(R; �) (D.30)�(x) = MpvUfx: (D.31)

References

Page 191: A quick introduction to Statistical Learning Theory and Support Vector Machines

158 APPENDIX D. TECHNICAL ADDENDATo see the relationship to (1.26), it should be noted that the spectrum of T coincideswith the essential range of v.For simplicity, we have above made the assumption that T is bounded. The sameargument, however, also works in the case of unbounded T (e.g. Reed and Simon,1980).For the purpose of practical applications, we are interested in maps � and operatorsT � 0 such that the kernel k de�ned by (D.20) can be computed analytically.Without going into detail, we brie y mention an example of a map �. De�ne� : x 7! k(x; :); (D.32)where k is some a priori speci�ed kernel, and T = P �P , with a regularization operatorP (Tikhonov and Arsenin, 1977). ThenkT (x;y) = ((Pk)(x; :) � (Pk)(y; :)) (D.33)coincides with a dot product matrix arising in a kernel-based regularization frameworkfor learning problems (Smola and Sch�olkopf, 1997b). If k is chosen as Green's functionof P �P , then kT and k can be shown to coincide, and the regularization approach isequivalent to the SV approach (Smola, Sch�olkopf, and M�uller, 1997).D.1.4 Polynomial Kernels and Higher Order CorrelationsConsider the mappings corresponding to kernels of the form (1.20): suppose the mono-mials xi1xi2 : : : xid are written such that i1 � i2 � : : : � id. Then the coe�cients (asthe p2 in (1.21)), arising from the fact that di�erent combinations of indices occurwith di�erent frequencies, are largest for i1 < i2 < : : : < id (let us assume here thatthe input dimensionality is not smaller than the polynomial degree d): in that case, wehave a coe�cient of pd!. If i1 = i2, say, the coe�cient will be q(d� 1)!. In general, ifn of the xi are equal, and the remaining ones are di�erent, then the coe�cient in thecorresponding component of � is q(d� n + 1)!. Thus, the terms belonging to the d-thorder correlations will be weighted with an extra factor pd! compared to the termsxdi , and compared to the terms where only d� 1 di�erent components occur, they arestill weighted stronger by pd. Consequently, kernel PCA with polynomial kernels willtend to pick up variance in the d-th order correlations mainly.D.2 Kernel Principal Component AnalysisD.2.1 The Eigenvalue Problem in the Space of Expansion Coe�cientsWe presently give a justi�cation for solving (3.14) rather than (3.13) in computing theEigensystem of the covariance matrix in F (cf. Sec. 3.2).

References

Page 192: A quick introduction to Statistical Learning Theory and Support Vector Machines

D.2. KERNEL PRINCIPAL COMPONENT ANALYSIS 159Being symmetric, K has an orthonormal basis of Eigenvectors (�i)i with corre-sponding Eigenvalues �i, thus for all i, we have K�i = �i�i (i = 1; : : : ;M). Tounderstand the relation between (3.13) and (3.14), we proceed as follows: �rst sup-pose �;� satisfy (3.13). We may expand � in K's Eigenvector basis as� = MXi=1 ai�i: (D.34)Equation (3.13) then reads M�Xi ai�i�i =Xi ai�2i�i; (D.35)or, equivalently, for all i = 1; : : : ;M ,M�ai�i = ai�2i : (D.36)This in turn means that for all i = 1; : : : ;M ,M� = �i or ai = 0 or �i = 0: (D.37)Note that the above are not exclusive or-s. We next assume that �;� satisfy (3.14).In that case, we �nd that (3.14) is equivalent toM�Xi ai�i =Xi ai�i�i; (D.38)i.e. for all i = 1; : : : ;M , M� = �i or ai = 0: (D.39)Comparing (D.37) and (D.39), we see that all solutions of the latter satisfy the former.However, they do not give its full set of solutions: given a solution of (3.14), we mayalways add multiples of Eigenvectors of K with Eigenvalue �i = 0 and still satisfy(3.13), with the same Eigenvalue.1 Note that this means that there exist solutionsof (3.13) which belong to di�erent Eigenvalues yet are not orthogonal in the space ofthe �k (for instance, take any two Eigenvectors with di�erent Eigenvalues, and add amultiple of the same Eigenvector with Eigenvalue 0 to both of them). This, however,does not mean that the Eigenvectors of �C in F are not orthogonal. Indeed, note that if� is an Eigenvector of K with Eigenvalue 0, then the corresponding vector Pi �i�(xi)is orthogonal to all vectors in the span of the �(xj) in F , since �(xj) �Xi �i�(xi)! = (K�)j = 0 for all j; (D.40)which means that Pi �i�(xi) = 0. Thus, the above di�erence between the solutionsof (3.13) and (3.14) is not relevant, since we are interested in vectors in F rather thanvectors in the space of the expansion coe�cients of (3.10). We therefore only need todiagonalize K in order to �nd all relevant solutions of (3.13).Note, �nally, that the rank of K determines the dimensionality of the span of the�(xj) in F , i.e. of the subspace that we are working in.1This observation could be used to change the vectors � of the solution, e.g. to make themmaximally sparse, without changing the solution.

References

Page 193: A quick introduction to Statistical Learning Theory and Support Vector Machines

160 APPENDIX D. TECHNICAL ADDENDAD.2.2 Centering in Feature SpaceIn Sec. 3.2, we made the assumption that our mapped data is centered in F , i.e.MXn=1�(xn) = 0: (D.41)We shall now drop this assumption. First note that given any � and any set ofobservations x1; : : : ;xM , the points~�(xi) := �(xi)� 1M MXi=1�(xi) (D.42)are centered. Thus, the assumptions of Sec. 3.2 now hold, and we go on to de�necovariance matrix and ~Kij = (~�(xi) � ~�(xj)) in F . We arrive at our already familiarEigenvalue problem ~�~� = ~K ~�; (D.43)with ~� being the expansion coe�cients of an Eigenvector (in F ) in terms of the points(D.42), ~V = MXi=1 ~�i ~�(xi): (D.44)We cannot compute ~K directly; however, we can express it in terms of its non-centeredcounterpart K. In the following, we shall use Kij = (�(xi) � �(xj)), in addition, weshall make use of the notation 1ij = 1 for all i; j.~Kij = (~�(xi) � ~�(xj)) (D.45)= (�(xi)� 1M MXm=1�(xm)) � (�(xj)� 1M MXn=1�(xn))!= (�(xi) � �(xj))� 1M MXm=1(�(xm) ��(xj))� 1M MXn=1(�(xi) � �(xn)) + 1M2 MXm;n=1(�(xm) � �(xn))= Kij � 1M MXm=1 1imKmj � 1M MXn=1Kin1nj + 1M2 MXm;n=1 1imKmn1njUsing the matrix (1M)ij := 1=M , we get the more compact expression~Kij = K � 1MK �K1M + 1MK1M : (D.46)We thus can compute ~K from K, and then solve the Eigenvalue problem (D.43). Asin (3.16), the solutions ~�k are normalized by normalizing the corresponding vectors~Vk in F , which translates into ~�k( ~�k � ~�k) = 1: (D.47)

References

Page 194: A quick introduction to Statistical Learning Theory and Support Vector Machines

D.3. ON THE TANGENT COVARIANCE MATRIX 161For feature extraction, we compute projections of centered �-images of test patternst onto the Eigenvectors of the covariance matrix of the centered points,( ~Vk ��(t)) = MXi=1 ~�ki (~�(xk) � ~�(t)): (D.48)Consider a set of test points t1; : : : ; tL, and de�ne two L�M matrices byKtestij = (�(ti) � �(xj)) (D.49)and ~Ktestij = ((�(ti)� 1M MXm=1�(xm)) � (�(xj)� 1M MXn=1�(xn)))! : (D.50)Similar to (D.45), we can express ~Ktest in terms of Ktest, and arrive at~Ktest = Ktest � 10MK �Ktest1M + 10MK1M ; (D.51)where 10M is the L �M matrix with all entries equal to 1=M . As the test points canbe chosen arbitrarily, we have thus in e�ect computed a centered version not only ofthe dot product matrix, but also of the kernel itself.D.3 On the Tangent Covariance MatrixIn this section, we give an alternative derivation of (4.10), obtained by modifyingthe analysis of Sec. 2.1.2 (Vapnik, 1998). There, we had to maximize (2.7) subjectto (2.6). When we want to construct invariant hyperplanes, the situation is slightlydi�erent. We do not only want to separate the training data, but we want to separateit in a way such that submitting a pattern to a transformation of an a priori speci�edLie group will not alter its class assignment. This can be achieved by enforcing thatthe classi�cation boundary be such that group actions move patterns parallel to thedecision boundary, rather than across it. A local statement of this property is therequirement that the Lie derivatives should be orthogonal to the normal w whichdetermines the separating hyperplane. Thus we modify (2.7) by adding a second termenforcing invariance:�(w) = 12 0@(1� �) 1 Xi=1 w � @@t ���t=0Ltzi!2 + �kwk21A (D.52)For � = 1, we recover the original objective function; for values 1 > � � 0, di�erentamounts of importance are assigned to invariance with respect to the Lie group oftransformations Lt.The above sum can be rewritten as1 Xi=1 w � @@t ���t=0Ltzi!2 = 1 Xi=1 w � @@t ���t=0Ltzi! @@t ���t=0Ltzi �w!= (w �Cw); (D.53)

References

Page 195: A quick introduction to Statistical Learning Theory and Support Vector Machines

162 APPENDIX D. TECHNICAL ADDENDAwhere the matrix C is de�ned as in (4.6),C := 1 Xi=1 @@t ���t=0Ltzi! @@t ���t=0Ltzi!> (D.54)(if we want to use more than one derivative operator, we also sum over these; in thatcase, we may want to orthonormalize the derivatives for each observation zi). To solvethe optimization problem, one introduces a LagrangianL(w; b;�) = 12 �(1� �)(w � Cw) + �kwk2�� Xi=1 �i (yi((zi �w) + b)� 1) (D.55)with Lagrange multipliers �i. At the point of the solution, the gradient of L withrespect to w must vanish: (1� �)Cw+ �w� Xi=1 �iyizi = 0 (D.56)As the left hand side of (D.53) is non-negative for anyw, C is a positive (not necessarilyde�nite) matrix. It follows that forC� := (1� �)C + �I (D.57)to be invertible (I denoting the identity), � > 0 is a su�cient condition. In that case,we get the following expansion for the solution vector:w = Xi=1 �iyiC�1� zi (D.58)Together with (2.3), (D.58) yields the decision functionf(z) = sgn Xi=1 �iyi(z �C�1� zi) + b! : (D.59)Substituting (D.58), and the fact that at the point of the solution, the partial derivativeof L with respect to b must vanish (Pi=1 �iyi = 0), into the Lagrangian (D.55), we getW (�) = 12 Xi=1 �iyiz>i �C�1� �> �C�C�1� � Xj=1�jyjzj�Xi=1 �iyiziC�1� Xj=1�jyjzj + Xi=1 �i: (D.60)

References

Page 196: A quick introduction to Statistical Learning Theory and Support Vector Machines

D.3. ON THE TANGENT COVARIANCE MATRIX 163By virtue of the fact that C� and thus also C�1� is symmetric, the dual form of theoptimization problem takes the following form: maximizeW (�) = Xi=1 �i � 12 Xi;j=1�i�jyiyj(zi � C�1� zj) (D.61)subject to (2.14) and (2.15).The same derivation can be carried out for the nonseparable case, leading to thecorresponding result with modi�ed constraints (2.22) and (2.23) (cf. Sec. 2.1.3).We conclude by generalizing to the nonlinear case. As in Sec. 2.1.4, we now thinkof the patterns zi no longer as living in input space, but as patterns in some featurespace F related to input space by a nonlinear map� : RN ! F (D.62)xi 7! zi = �(xi): (D.63)Unfortunately, (D.59) and (D.61) are not simply written in terms of dot productsbetween images of input patterns under �. Hence, substituting kernel functions fordot products will not do. Note, moreover, that C� now is an operator in a possiblyin�nite-dimensional space, with C being de�ned as in (4.15). We cannot compute itexplicitely, but we can nevertheless compute (D.59) and (D.61), which is all we need.First note that for all x;y 2 RN ,(�(x) � C�1� �(y)) = (C� 12� �(x) � C� 12� �(y)); (D.64)with C� 12� being the positive square root of C�1� . At this point, methods similar tokernel PCA come to our rescue. As C� is symmetric, we may diagonalize it asC� = SDS>; (D.65)hence C� 12� = SD� 12S>: (D.66)Substituting (D.66) into (D.64), and using the fact that S is unitary, we obtain(�(x) � C�1� �(y)) = (SD� 12S>�(x) � SD� 12S>�(y)) (D.67)= (D� 12S>�(x) �D� 12S>�(y)): (D.68)This, however, is simply a dot product between kernel PCA feature vectors: S>�(x)computes projections onto Eigenvectors of C� (i.e. features), and D� 12 rescales them.Note that we have thus again arrived at the nonlinear tangent covariance matrix ofSec. 4.2.2; this time, however, the approach was motivated solely by constructing

References

Page 197: A quick introduction to Statistical Learning Theory and Support Vector Machines

164 APPENDIX D. TECHNICAL ADDENDAinvariant hyperplanes in feature space, and the nonlinear feature extraction by thetangent covariance matrix is a mere by-product.To carry out kernel PCA on C�, we essentially have to go through the analysisof kernel PCA using C� instead of the covariance matrix of the mapped data in F .The modi�cations arising from the fact that we are dealing with tangent vectors werealready described in Sec. 4.2.2, hence, we shall presently only sketch the additionalmodi�cations for � > 0: here, we are looking for solutions of the Eigenvalue equation�V = C�V with � > � (let us assume that � < 1, otherwise all Eigenvalues areidentical to �, the minimal Eigenvalue).2 These lie in the span of the tangent vectors.In complete analogy to (3.14), we then arrive at`�� = ((1� �)K + �I)�; (D.69)and the normalization condition for the coe�cients �k of the k-th Eigenvector reads1 = �k � �1� � (� ��); (D.70)where the �k > � are the Eigenvalues of (1� �)K + �I. Feature extraction is carriedout as in (4.25).We conclude by noting an essential di�erence to the approach of (4.11), which webelieve is an advantage of the present method: in (4.11), the pattern preprocessingwas assumed to be linear. In the present method, the goal to get invariant hyperplanesin feature space naturally led to a nonlinear preprocessing operation.

2If we want �I also to have an e�ect outside of the span of the tangent vectors, we have to modifythe set in which we expand our solutions.

References

Page 198: A quick introduction to Statistical Learning Theory and Support Vector Machines

BibliographyY. S. Abu-Mostafa. Hints. Neural Computation, 7(4):639{671, 1995.M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potentialfunction method in pattern recognition learning. Automation and Remote Control,25:821 { 837, 1964.S. Amari, N. Murata, K.-R. M�uller, M. Finke, and H. Yang. Aymptotic statisticaltheory of overtraining and cross-validation. IEEE Trans. on Neural Networks, 8(5),1997.J. K. Anlauf and M. Biehl. The adatron: an adaptive perceptron algorithm. Europhys.Letters, 10:687 { 692, 1989.H. Baird. Document image defect models. In Proceddings, IAPR Workshop on Syn-tactic and Structural Pattern Recognition, pages 38 { 46, Murray Hill, NJ, 1990.H. Barlow. The neuron doctrine in perception. In M. Gazzaniga, editor, The CognitiveNeurosciences, pages 415 { 435. MIT Press, Cambridge, MA, 1995.A. Barron. Predicted squared error: a criterion for automatic model selection. InS. Farlow, editor, Self-organizing Methods in Modeling. Marcel Dekker, New York,1984.Peter L. Bartlett. For valid generalization the size of the weights is more importantthan the size of the network. In Michael C. Mozer, Michael I. Jordan, and ThomasPetsche, editors, Advances in Neural Information Processing Systems, volume 9,page 134, Cambridge, MA, 1997. MIT Press.D. P. Bertsekas. Nonlinear Programming. Athena Scienti�c, Belmont, MA, 1995.D. Beymer and T. Poggio. Image representations for visual learning. Science, 272(5270):1905{1909, 1996.R. Bhatia. Matrix Analysis. Springer Verlag, New York, 1997.C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford,1995. 165

References

Page 199: A quick introduction to Statistical Learning Theory and Support Vector Machines

166 BIBLIOGRAPHYV. Blanz. Bildbasierte Objekterkennung und die Bestimmung optimaler Ansichten.Diplomarbeit in Physik, Universit�at T�ubingen, 1995.V. Blanz, B. Sch�olkopf, H. B�ultho�, C. Burges, V. Vapnik, and T. Vetter. Comparisonof view-based object recognition algorithms using realistic 3D models. In C. von derMalsburg, W. von Seelen, J. C. Vorbr�uggen, and B. Sendho�, editors, Arti�cialNeural Networks | ICANN'96, pages 251 { 256, Berlin, 1996. Springer LectureNotes in Computer Science, Vol. 1112.B. E. Boser, I .M. Guyon, and V. N. Vapnik. A training algorithm for optimal marginclassi�ers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop onComputational Learning Theory, pages 144{152, Pittsburgh, PA, 1992. ACM Press.L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun,U. A. M�uller, E. S�ackinger, P. Simard, and V. Vapnik. Comparison of classi�ermethods: a case study in handwritten digit recognition. In Proceedings of the 12thInternational Conference on Pattern Recognition and Neural Networks, Jerusalem,pages 77 { 87. IEEE Computer Society Press, 1994.L. Bottou and V. N. Vapnik. Local learning algorithms. Neural Computation, 4(6):888{900, 1992.J. Bromley and E. S�ackinger. Neural-network and k-nearest-neighbor classi�ers. Tech-nical Report 11359{910819{16TM, AT&T, 1991.H. H. B�ultho� and S. Edelman. Psychophysical support for a 2-D view interpolationtheory of object recognition. Proceedings of the National Academy of Science, 89:60 { 64, 1992.C. J. C. Burges. Simpli�ed support vector decision rules. In L. Saitta, editor, Pro-ceedings, 13th Intl. Conf. on Machine Learning, pages 71{77, San Mateo, CA, 1996.Morgan Kaufmann.C. J. C. Burges and B. Sch�olkopf. Improving the accuracy and speed of support vectorlearning machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances inNeural Information Processing Systems 9, pages 375{381, Cambridge, MA, 1997.MIT Press.E. I. Chang and R. L. Lippmann. A boundary hunting radial basis function classi�erwhich allocates centers constructively. In S. J. Hanson, J. D. Cowan, and C. L.Giles, editors, Advances in Neural Information Processing Systems 5, San Mateo,CA, 1993. Morgan Kaufmann.C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 { 297,1995.

References

Page 200: A quick introduction to Statistical Learning Theory and Support Vector Machines

BIBLIOGRAPHY 167R. Courant and D. Hilbert. Methods of Mathematical Physics, volume 1. IntersciencePublishers, Inc, New York, 1953.K. I. Diamantaras and S. Y. Kung. Principal Component Neural Networks. Wiley,New York, 1996.H. Drucker, R. Schapire, and P. Simard. Boosting performance in neural networks.International Journal of Pattern Recognition and Arti�cial Intelligence, 7:705 { 719,1993.R. O. Duda and P. E. Hart. Pattern Classi�cation and Scene Analysis. Wiley, NewYork, 1973.C.J. Du�y and R.H. Wurtz. Sensitivity of MST neurons to optic ow �eld stimuli. I. Acontinuum of response selectivity to large-�eld stimuli. Journal of Neurophysiology,65:1329{1345, 1991.N. Dunford and J. T. Schwartz. Linear Operators Part II: Spectral Theory, Self AdjointOperators in Hilbert Space. Number VII in Pure and Applied Mathematics. JohnWiley & Sons, New York, 1963.S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variancedilemma. Neural Computation, 4:1{58, 1992.F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: Fromregularization to radial, tensor and additive splines. A.I. Memo No. 1430, Arti�cialIntelligence Laboratory, Massachusetts Institute of Technology, 1993.I. Guyon, B. Boser, and V. Vapnik. Automatic capacity tuning of very large VC-dimension classi�ers. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Ad-vances in Neural Information Processing Systems, volume 5, pages 147{155. MorganKaufmann, San Mateo, CA, 1993.I. Guyon, N. Mati�c, and V. Vapnik. Discovering informative patterns and data clean-ing. In U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smythand R. Uthurusamy,editors, Advances in Knowledge Discovery and Data Mining, pages 181 { 203. MITPress, Cambridge, MA, 1996.J. B. Hampshire and A. Waibel. A novel objective function for improved phonemerecognition using time-delay neural networks. IEEE Trans. Neural Networks, 1:216 { 228, 1990.W. H�ardle. Applied Nonparametric Regression. Cambridge University Press, Cam-bridge, 1990.T. Hastie and W. Stuetzle. Principal curves. JASA, 84:502 { 516, 1989.

References

Page 201: A quick introduction to Statistical Learning Theory and Support Vector Machines

168 BIBLIOGRAPHYS. Haykin. Neural Networks : A Comprehensive Foundation. Macmillan, New York,1994.T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministic annealing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1{25, 1997.H. Hotelling. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology, 24:417{441 and 498{520, 1933.P. Jolicoeur, M.A. Gluck, and S.M. Kosslyn. From pictures to words: Making theconnection. Cognitive Psychology, 16:243{275, 1984.I. T. Jolli�e. Principal Component Analysis. Springer-Verlag, New York, 1986.M. Kac and S. M. Ulam. Mathematics and Logic. Praeger, Britannica perspective,New York, 1968.J. Karhunen and J. Joutsensalo. Generalizations of principal component analysis,optimization problems, and neural networks. Neural Networks, 8(4):549{562, 1995.K. Karhunen. Zur Spektraltheorie stochastischer Prozesse. Ann. Acad. Sci. Fenn., 34,1946.M. Kirby and L. Sirovich. Application of the Karhunen-Lo�eve procedure for the char-acterization of human faces. IEEE Transactions on Pattern Analysis and MachineIntelligence, 12(1):103{108, 1990.F. Klein. Vergleichende Betrachtungen �uber neuere geometrische Forschungen. Verlagvon Andreas Deichert, Erlangen, 1872.T. Kohonen. Self-organized formation of topologically correct feature maps. BiologicalCybernetics, 43:59 { 69, 1982.A. N. Kolmogorov. Three approaches to the quantitative de�nition of information.Problems of Information Transmission, 1:1 { 7, 1965.H. G. Krapp and R. Hengstenberg. Estimation of self-motion by optic ow processingin single visual interneurons. Nature, 384:463 { 466, 1996.U. Kressel. Private communication. The quoted results are summarized onftp://ftp.mpik-tueb.mpg.de/pub/chair dataset/README, 1996.Y. LeCun. Une proc�edure d'apprentissage pour R�eseau �a seuil assymm�etrique. In Cog-nitiva: A la Fronti�ere de l'Intelligence Arti�cielle des Sciences de la Connaissancedes Neurosciences, pages 599{604, Paris, France, 1985. CESTA.Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, andL. J. Jackel. Backpropagation applied to handwritten zip code recognition. NeuralComputation, 1:541 { 551, 1989.

References

Page 202: A quick introduction to Statistical Learning Theory and Support Vector Machines

BIBLIOGRAPHY 169Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker,I. Guyon, U. A. M�uller, E. S�ackinger, P. Simard, and V. Vapnik. Comparisonof learning algorithms for handwritten digit recognition. In F. Fogelman-Souli�eand P. Gallinari, editors, Proceedings ICANN'95 | International Conference onArti�cial Neural Networks, volume II, pages 53 { 60, Nanterre, France, 1995. EC2.J. Liter et al. Psychophysical and computational experiments on the MPI objectdatabases. Technical report, Max-Planck-Institut f�ur biologische Kybernetik, 1997.N. K. Logothetis, J. Pauls, and T. Poggio. Shape representation in the inferior tem-poral cortex of monkeys. Current Biology, 5:552 { 563, 1995.D. G. Luenberger. Introduction to Linear and Nonlinear Programming. Addison-Wesley, Reading, MA, 1973.M. Mishkin, L.G. Ungerleider, and K.A. Macko. Object vision and spatial vision: twocortical pathways. Trends in Neurosciences, 6:414{417, 1983.J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units.Neural Computation, 1(2):281{294, 1989.K.-R. M�uller, A. Smola, G. R�atsch, B. Sch�olkopf, J. Kohlmorgen, and V. Vapnik. Pre-dicting time series with support vector machines. In ICANN'97, page 999. SpringerLecture Notes in Computer Science, 1997.M. Z. Nashed and G. Wahba. Generalized inverses in reproducing kernel spaces: Anapproach to regularization of linear operator equations. SIAM J. Math. Anal., 5:974{987, 1974.E. Oja. A simpli�ed neuron model as a principal component analyzer. J. Math.Biology, 15:267 { 273, 1982.E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vectormachines. In NNSP'97, 1997. in press.K. Pearson. On lines and planes of closest �t to points in space. Philosophical Maga-zine, 2 (sixth series):559{572, 1901.T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201{209,1975.T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of theIEEE, 78(9), 1990.T. Poggio and T. Vetter. Recognition and structure from one 2D model view: observa-tions on prototypes, object classes, and symmetries. A.I. Memo No. 1347, Arti�cialIntelligence Laboratory, Massachusetts Institute of Technology, 1992.

References

Page 203: A quick introduction to Statistical Learning Theory and Support Vector Machines

170 BIBLIOGRAPHYH. Primas. Chemistry, Quantum Mechanics and Reductionism. Springer-Verlag,Berlin, 1983. second edition.R. P. N. Rao and D. H. Ballard. Localized receptive �elds may mediate transformation-invariant recognition in the visual cortex. Technical Report 97.2, National ResourceLaboratory for the Study of Brain and Behavior, Computer Science Department,University of Rochester, 1997.M. Reed and B. Simon. Method of modern mathematical physics. Vol. 1: FunctionalAnalysis. Academic Press, San Diego, 1980.D. Reilly, L. N. Cooper, and C. Elbaum. A neural model for category learning. Biol.Cybern., 45:35 { 41., 1982.B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press,Cambridge, 1996.J. Rissanen. Modeling by shortest data description. Automatica, 14:465{471, 1978.H. J. Ritter, T. M. Martinetz, and K. J. Schulten. Neuronale Netze: Eine Einf�uhrungin die Neuroinformatik selbstorganisierender Abbildungen. Addison-Wesley, Munich,Germany, 1990.E. Rosch, C.B. Mervis, W. Gray, D. Johnson, and P. Boyes-Braem. Basic objects innatural categories. Cognitive Psychology, 8:382{439, 1976.D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations byback-propagating errors. Nature, 323(9):533{536, 1986.T. D. Sanger. Optimal unsupervised learning in a single-layer linear feedforward net-work. Neural Networks, 2:459{473, 1989.R. E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: a new ex-planation for the e�ectiveness of voting methods. In Machine Learning: Proceedingsof the Fourteenth International Conference, 1997.B. Sch�olkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. InU. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International Con-ference on Knowledge Discovery & Data Mining. AAAI Press, Menlo Park, CA,1995.B. Sch�olkopf. K�unstliches Lernen. In S. Bornholdt and P. H. Feindt, editors, Komplexeadaptive Systeme (Forum f�ur Interdisziplin�are Forschung Bd. 15), pages 93 { 117.R�oll, Dettelbach, 1996.B. Sch�olkopf, C. Burges, and V. Vapnik. Incorporating invariances in support vectorlearning machines. In C. von der Malsburg, W. von Seelen, J. C. Vorbr�uggen, andB. Sendho�, editors, Arti�cial Neural Networks | ICANN'96, pages 47 { 52, Berlin,1996a. Springer Lecture Notes in Computer Science, Vol. 1112.

References

Page 204: A quick introduction to Statistical Learning Theory and Support Vector Machines

BIBLIOGRAPHY 171B. Sch�olkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vectorkernels. Accepted for NIPS'97 (Proceedings to be published by MIT Press), 1997a.B. Sch�olkopf, A. Smola, and K.-R. M�uller. Nonlinear component analysis as a ker-nel eigenvalue problem. Technical Report 44, Max-Planck-Institut f�ur biologischeKybernetik, 1996b. in press (Neural Computation).B. Sch�olkopf, A. Smola, and K.-R. M�uller. Kernel principal component analysis. InICANN'97, page 583. Springer Lecture Notes in Computer Science, 1997b.B. Sch�olkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik.Comparing support vector machines with gaussian kernels to radial basis functionclassi�ers. A.I. Memo No. 1599, Massachusetts Institute of Techology, 1996c.J. Sch�urmann. Pattern Classi�cation: a uni�ed view of statistical and neural ap-proaches. Wiley, New York, 1996.J. Segman, J. Rubinstein, and Y. Y. Zeevi. The canonical coordinates method for pat-tern deformation: theoretical and computational considerations. IEEE Transactionson Pattern Analysis and Machine Intelligence, 14:1171 { 1183, 1992.J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony. A framework for struc-tural risk minimization. In COLT, 1996.P. Simard, Y. LeCun, and J. Denker. E�cient pattern recognition using a new trans-formation distance. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advancesin Neural Information Processing Systems 5, pages 50{58, San Mateo, CA, 1993.Morgan Kaufmann.P. Simard, B. Victorri, Y. LeCun, and J. Denker. Tangent prop | a formalism forspecifying selected invariances in an adaptive network. In J. E. Moody, S. J. Hanson,and R. P. Lippmann, editors, Advances in Neural Information Processing Systems4, pages 895{903, San Mateo, CA, 1992. Morgan Kaufmann.A. Smola and B. Sch�olkopf. From regularization operators to support vector kernels.Accepted for NIPS'97, 1997a.A. Smola and B. Sch�olkopf. On a kernel-based method for pattern recognition, regres-sion, approximation and operator inversion. Algorithmica, 1997b. in press.A. Smola, B. Sch�olkopf, and K.-R. M�uller. The connection between regularizationoperators and support vector kernels. submitted to Neural Networks, 1997.A. J. Smola. Regression estimation with support vector learning machines. Diplomar-beit, Technische Universit�at M�unchen, 1996.K. Sung. Learning and Example Selection for Object and Pattern Detection. PhDthesis, Massachusetts Institute of Technology, 1996.

References

Page 205: A quick introduction to Statistical Learning Theory and Support Vector Machines

172 BIBLIOGRAPHYA. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed Problems. W. H. Winston,Washington, D.C., 1977.N. Troje and H. B�ultho�. How is bilateral symmetry of human faces used for recog-nition of novel views? Technical Report 38, Max-Planck-Institut f�ur biologischeKybernetik, 1996. to appear in Vision Research.S. Ullman. Aligning pictorial descriptions: An approach to object recognition. Cogni-tion, 32:193{254, 1989.S. Ullman. High-Level Vision. MIT Press, Cambridge, MA, 1996.V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka,Moscow, 1979. (English translation: Springer Verlag, New York, 1982).V. Vapnik. Inductive principles of statistics and learning theory. In P. Smolensky,M. C. Mozer, and D. E. Rumelhart, editors, Mathematical Perspectives on NeuralNetworks. Lawrence Erlbaum, Mahwah, NJ, 1995a.V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York,1995b.V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. forthcoming.V. Vapnik and A. Chervonenkis. Uniform convergence of frequencies of occurence ofevents to their probabilities. Dokl. Akad. Nauk SSSR, 181:915 { 918, 1968.V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka,Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorieder Zeichenerkennung, Akademie-Verlag, Berlin, 1979).V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approxi-mation, regression estimation, and signal processing. In M. Mozer, M. Jordan, andT. Petsche, editors, Advances in Neural Information Processing Systems 9, pages281{287, Cambridge, MA, 1997. MIT Press.T. Vetter. An early vision model for 3D object recognition. In Fachtagung derGesellschaft f�ur Kognitionswissenschaften, 1994.T. Vetter and T. Poggio. Symmetric 3D objects are an easy case for 2D object recog-nition. Spatial Vision, 8(4):443{453, 1994.T. Vetter and T. Poggio. Linear objectclasses and image synthesis from a singleexample image. IEEE Transactions on Pattern Analysis and Machine Intelligence,1997. in press.T. Vetter, T. Poggio, and H. B�ultho�. The importance of symmetry and virtual viewsin three-dimensional object recognition. Current Biology, 4:18 { 23, 1994.

References

Page 206: A quick introduction to Statistical Learning Theory and Support Vector Machines

BIBLIOGRAPHY 173T. Vetter and N. Troje. Separation of texture and shape in images of faces for imagecoding and synthesis. Journal of the Optical Society of America, in press, 1997.G. Wahba. Convergence rates of certain approximate solutions to Fredholm integralequations of the �rst kind. Journal of Approximation Theory, 7:167 { 185, 1973.C. K. I. Williams. Prediction with Gaussian processes. Preprint, 1997.A. Yuille and N. Grzywacz. The motion coherence theory. In Proceedings of theInternational Conference on Computer Vision, pages 344{354, Washington, D.C.,1988. IEEE Computer Society Press.

References

Page 207: A quick introduction to Statistical Learning Theory and Support Vector Machines

Journal of Machine Learning Research 12 (2011) 2825-2830 Submitted 3/11; Revised 8/11; Published 10/11

Scikit-learn: Machine Learning in Python

Fabian Pedregosa FABIAN .PEDREGOSA@INRIA .FR

Gael Varoquaux [email protected]

Alexandre Gramfort ALEXANDRE .GRAMFORT@INRIA .FR

Vincent Michel VINCENT.MICHEL@LOGILAB .FR

Bertrand Thirion BERTRAND.THIRION@INRIA .FR

Parietal, INRIA SaclayNeurospin, Bat 145, CEA Saclay91191 Gif sur Yvette – France

Olivier Grisel OLIVIER [email protected]

Nuxeo20 rue Soleillet75 020 Paris – France

Mathieu Blondel MBLONDEL@AI .CS.KOBE-U.AC.JP

Kobe University1-1 Rokkodai, NadaKobe 657-8501 – Japan

Peter Prettenhofer PETER.PRETTENHOFER@GMAIL .COM

Bauhaus-Universitat WeimarBauhausstr. 1199421 Weimar – Germany

Ron Weiss RONWEISS@GMAIL .COM

Google Inc76 Ninth AvenueNew York, NY 10011 – USA

Vincent Dubourg VINCENT.DUBOURG@GMAIL .COM

Clermont Universite, IFMA, EA 3867, LaMIBP 10448, 63000 Clermont-Ferrand – France

Jake Vanderplas [email protected]

Astronomy DepartmentUniversity of Washington, Box 351580Seattle, WA 98195 – USA

Alexandre Passos ALEXANDRE .TP@GMAIL .COM

IESL LabUMass AmherstAmherst MA 01002 – USA

David Cournapeau COURNAPE@GMAIL .COM

Enthought21 J.J. Thompson AvenueCambridge, CB3 0FA – UK

c©2011 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, BertrandThirion, Olivier Grisel, Mathieu Blondel,Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher,Matthieu Perrot andEdouard Duchesnay

References

UPCLab2013
Typewritten text
REF [5]
Page 208: A quick introduction to Statistical Learning Theory and Support Vector Machines

PEDREGOSA, VAROQUAUX, GRAMFORT ET AL.

Matthieu Brucher MATTHIEU .BRUCHER@GMAIL .COM

Total SA, CSTJFavenue Larribau64000 Pau – France

Matthieu Perrot MATTHIEU [email protected]

Edouard Duchesnay [email protected]

LNAONeurospin, Bat 145, CEA Saclay91191 Gif sur Yvette – France

Editor: Mikio Braun

AbstractScikit-learnis a Python module integrating a wide range of state-of-the-art machine learning algo-rithms for medium-scale supervised and unsupervised problems. This package focuses on bring-ing machine learning to non-specialists using a general-purpose high-level language. Emphasis isput on ease of use, performance, documentation, and API consistency. It has minimal dependen-cies and is distributed under the simplified BSD license, encouraging its use in both academicand commercial settings. Source code, binaries, and documentation can be downloaded fromhttp://scikit-learn.sourceforge.net.

Keywords: Python, supervised learning, unsupervised learning, model selection

1. Introduction

The Python programming language is establishing itself as one of the most popular languages forscientific computing. Thanks to its high-level interactive nature and its maturingecosystem of sci-entific libraries, it is an appealing choice for algorithmic development and exploratory data analysis(Dubois, 2007; Milmann and Avaizis, 2011). Yet, as a general-purposelanguage, it is increasinglyused not only in academic settings but also in industry.

Scikit-learnharnesses this rich environment to provide state-of-the-art implementationsof manywell known machine learning algorithms, while maintaining an easy-to-use interface tightly inte-grated with the Python language. This answers the growing need for statistical data analysis bynon-specialists in the software and web industries, as well as in fields outside of computer-science,such as biology or physics.Scikit-learndiffers from other machine learning toolboxes in Pythonfor various reasons:i) it is distributed under the BSD licenseii) it incorporates compiled code forefficiency, unlike MDP (Zito et al., 2008) and pybrain (Schaul et al., 2010), iii) it depends only onnumpy and scipy to facilitate easy distribution, unlike pymvpa (Hanke et al., 2009) that has optionaldependencies such as R and shogun, andiv) it focuses on imperative programming, unlike pybrainwhich uses a data-flow framework. While the package is mostly written in Python, it incorporatesthe C++ libraries LibSVM (Chang and Lin, 2001) and LibLinear (Fan et al.,2008) that provide ref-erence implementations of SVMs and generalized linear models with compatible licenses. Binarypackages are available on a rich set of platforms including Windows and any POSIX platforms.

2826

References

Page 209: A quick introduction to Statistical Learning Theory and Support Vector Machines

SCIKIT-LEARN: MACHINE LEARNING IN PYTHON

Furthermore, thanks to its liberal license, it has been widely distributed as part of major free soft-ware distributions such as Ubuntu, Debian, Mandriva, NetBSD and Macports and in commercialdistributions such as the “Enthought Python Distribution”.

2. Project Vision

Code quality. Rather than providing as many features as possible, the project’s goal has been toprovide solid implementations. Code quality is ensured with unit tests—as of release 0.8, testcoverage is 81%—and the use of static analysis tools such aspyflakes andpep8. Finally, westrive to use consistent naming for the functions and parameters used throughout a strict adherenceto the Python coding guidelines and numpy style documentation.BSD licensing.Most of the Python ecosystem is licensed with non-copyleft licenses. Whilesuchpolicy is beneficial for adoption of these tools by commercial projects, it does impose some restric-tions: we are unable to use some existing scientific code, such as the GSL.Bare-bone design and API.To lower the barrier of entry, we avoid framework code and keep thenumber of different objects to a minimum, relying on numpy arrays for data containers.Community-driven development.We base our development on collaborative tools such as git, githuband public mailing lists. External contributions are welcome and encouraged.Documentation. Scikit-learnprovides a∼300 page user guide including narrative documentation,class references, a tutorial, installation instructions, as well as more than 60examples, some fea-turing real-world applications. We try to minimize the use of machine-learning jargon, while main-taining precision with regards to the algorithms employed.

3. Underlying Technologies

Numpy: the base data structure used for data and model parameters. Input data ispresented asnumpy arrays, thus integrating seamlessly with other scientific Python libraries. Numpy’s view-based memory model limits copies, even when binding with compiled code (Van derWalt et al.,2011). It also provides basic arithmetic operations.Scipy: efficient algorithms for linear algebra, sparse matrix representation, special functions andbasic statistical functions.Scipyhas bindings for many Fortran-based standard numerical packages,such as LAPACK. This is important for ease of installation and portability, as providing librariesaround Fortran code can prove challenging on various platforms.Cython: a language for combining C in Python. Cython makes it easy to reach the performanceof compiled languages with Python-like syntax and high-level operations. It is also used to bindcompiled libraries, eliminating the boilerplate code of Python/C extensions.

4. Code Design

Objects specified by interface, not by inheritance.To facilitate the use of external objects withscikit-learn, inheritance is not enforced; instead, code conventions provide a consistent interface.The central object is anestimator, that implements afit method, accepting as arguments an inputdata array and, optionally, an array of labels for supervised problems.Supervised estimators, such asSVM classifiers, can implement apredict method. Some estimators, that we calltransformers,for example, PCA, implement atransform method, returning modified input data. Estimators

2827

References

Page 210: A quick introduction to Statistical Learning Theory and Support Vector Machines

PEDREGOSA, VAROQUAUX, GRAMFORT ET AL.

scikit-learn mlpy pybrain pymvpa mdp shogun

Support Vector Classification 5.2 9.47 17.5 11.52 40.48 5.63Lasso (LARS) 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -k-Nearest Neighbors 0.57 1.41 - 0.56 0.58 1.36PCA (9 components) 0.18 - - 8.93 0.47 0.33k-Means (9 clusters) 1.34 0.79 ⋆ - 35.75 0.68License BSD GPL BSD BSD BSD GPL

-: Not implemented. ⋆: Does not converge within 1 hour.

Table 1: Time in seconds on the Madelon data set for various machine learning libraries exposedin Python: MLPy (Albanese et al., 2008), PyBrain (Schaul et al., 2010), pymvpa (Hankeet al., 2009), MDP (Zito et al., 2008) and Shogun (Sonnenburg et al., 2010). For morebenchmarks seehttp://github.com/scikit-learn.

may also provide ascore method, which is an increasing evaluation of goodness of fit: a log-likelihood, or a negated loss function. The other important object is thecross-validation iterator,which provides pairs of train and test indices to split input data, for exampleK-fold, leave one out,or stratified cross-validation.

Model selection. Scikit-learncan evaluate an estimator’s performance or select parameters usingcross-validation, optionally distributing the computation to several cores. This is accomplished bywrapping an estimator in aGridSearchCV object, where the “CV” stands for “cross-validated”.During the call tofit, it selects the parameters on a specified parameter grid, maximizing a score(thescore method of the underlying estimator).predict, score, ortransform are then delegatedto the tuned estimator. This object can therefore be used transparently as any other estimator. Crossvalidation can be made more efficient for certain estimators by exploiting specific properties, suchas warm restarts or regularization paths (Friedman et al., 2010). This is supported through specialobjects, such as theLassoCV. Finally, aPipeline object can combine severaltransformers andan estimator to create a combined estimator to, for example, apply dimension reduction beforefitting. It behaves as a standard estimator, andGridSearchCV therefore tune the parameters of allsteps.

5. High-level yet Efficient: Some Trade Offs

While scikit-learnfocuses on ease of use, and is mostly written in a high level language, carehasbeen taken to maximize computational efficiency. In Table 1, we compare computation time for afew algorithms implemented in the major machine learning toolkits accessible in Python.We usethe Madelon data set (Guyon et al., 2004), 4400 instances and 500 attributes, The data set is quitelarge, but small enough for most algorithms to run.

SVM.While all of the packages compared call libsvm in the background, the performance ofscikit-learn can be explained by two factors. First, our bindings avoid memory copies andhave up to40% less overhead than the original libsvm Python bindings. Second, we patch libsvm to improveefficiency on dense data, use a smaller memory footprint, and better use memory alignment andpipelining capabilities of modern processors. This patched version also provides unique features,such as setting weights for individual samples.

2828

References

Page 211: A quick introduction to Statistical Learning Theory and Support Vector Machines

SCIKIT-LEARN: MACHINE LEARNING IN PYTHON

LARS.Iteratively refining the residuals instead of recomputing them gives performance gains of2–10 times over the reference R implementation (Hastie and Efron, 2004).Pymvpauses this imple-mentation via the Rpy R bindings and pays a heavy price to memory copies.Elastic Net.We benchmarked thescikit-learncoordinate descent implementations of Elastic Net. Itachieves the same order of performance as the highly optimized Fortran versionglmnet(Friedmanet al., 2010) on medium-scale problems, but performance on very large problems is limited sincewe do not use the KKT conditions to define an active set.kNN.The k-nearest neighbors classifier implementation constructs a ball tree (Omohundro, 1989)of the samples, but uses a more efficient brute force search in large dimensions.PCA.For medium to large data sets,scikit-learnprovides an implementation of a truncated PCAbased on random projections (Rokhlin et al., 2009).k-means. scikit-learn’s k-means algorithm is implemented in pure Python. Its performance is lim-ited by the fact that numpy’s array operations take multiple passes over data.

6. Conclusion

Scikit-learnexposes a wide variety of machine learning algorithms, both supervised andunsuper-vised, using a consistent, task-oriented interface, thus enabling easy comparison of methods for agiven application. Since it relies on the scientific Python ecosystem, it can easily be integrated intoapplications outside the traditional range of statistical data analysis. Importantly, the algorithms,implemented in a high-level language, can be used as building blocks for approaches specific toa use case, for example, in medical imaging (Michel et al., 2011). Future work includesonlinelearning, to scale to large data sets.

References

D. Albanese, G. Merler, S.and Jurman, and R. Visintainer. MLPy: high-performance python pack-age for predictive modeling. InNIPS, MLOSS Workshop, 2008.

C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001.

P.F. Dubois, editor.Python: Batteries Included, volume 9 ofComputing in Science & Engineering.IEEE/AIP, May 2007.

R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR:a library for large linearclassification.The Journal of Machine Learning Research, 9:1871–1874, 2008.

J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models viacoordinate descent.Journal of Statistical Software, 33(1):1, 2010.

I Guyon, S. R. Gunn, A. Ben-Hur, and G. Dror. Result analysis of theNIPS 2003 feature selectionchallenge, 2004.

M. Hanke, Y.O. Halchenko, P.B. Sederberg, S.J. Hanson, J.V. Haxby, and S. Pollmann. PyMVPA:A Python toolbox for multivariate pattern analysis of fMRI data.Neuroinformatics, 7(1):37–53,2009.

2829

References

Page 212: A quick introduction to Statistical Learning Theory and Support Vector Machines

PEDREGOSA, VAROQUAUX, GRAMFORT ET AL.

T. Hastie and B. Efron. Least Angle Regression, Lasso and ForwardStagewise.http://cran.r-project.org/web/packages/lars/lars.pdf, 2004.

V. Michel, A. Gramfort, G. Varoquaux, E. Eger, C. Keribin, and B. Thirion. A supervised clusteringapproach for fMRI-based inference of brain states.Patt Rec, page epub ahead of print, April2011. doi: 10.1016/j.patcog.2011.04.006.

K.J. Milmann and M. Avaizis, editors.Scientific Python, volume 11 ofComputing in Science &Engineering. IEEE/AIP, March 2011.

S.M. Omohundro. Five balltree construction algorithms. ICSI Technical Report TR-89-063, 1989.

V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis.SIAM Journal on Matrix Analysis and Applications, 31(3):1100–1124, 2009.

T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Ruckstieß, and J. Schmidhuber.PyBrain.The Journal of Machine Learning Research, 11:743–746, 2010.

S. Sonnenburg, G. Ratsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl,and V. Franc. The SHOGUN machine learning toolbox.Journal of Machine Learning Research,11:1799–1802, 2010.

S. Van der Walt, S.C Colbert, and G. Varoquaux. The NumPy array: A structure for efficientnumerical computation.Computing in Science and Engineering, 11, 2011.

T. Zito, N. Wilbert, L. Wiskott, and P. Berkes. Modular toolkit for data processing (MDP): A Pythondata processing framework.Frontiers in Neuroinformatics, 2, 2008.

2830

References

Page 213: A quick introduction to Statistical Learning Theory and Support Vector Machines

International Journal of Computer Vision 38(1), 9–13, 2000c© 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.

Statistical Learning Theory: A Primer

THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO POGGIOCenter for Biological and Computational Learning, Artificial Intelligence Laboratory, MIT,

Cambridge, MA, [email protected]

[email protected]

[email protected]

Abstract. In this paper we first overview the main concepts of Statistical Learning Theory, a framework in whichlearning from examples can be studied in a principled way. We then briefly discuss well known as well as emerginglearning techniques such as Regularization Networks and Support Vector Machines which can be justified in termof the same induction principle.

Keywords: VC-dimension, structural risk minimization, regularization networks, support vector machines

1. Introduction

The goal of this paper is to provide a short introduc-tion to Statistical Learning Theory (SLT) which studiesproblems and techniques ofsupervised learning. For amore detailed review of SLT see Evgeniou et al. (1999).In supervised learning—orlearning-from-examples—a machine is trained, instead of programmed, to per-form a given task on a number of input-output pairs.According to this paradigm, training means choosinga function which best describes the relation betweenthe inputs and the outputs. The central question ofSLT is how well the chosen function generalizes, orhow well it estimates the output for previously unseeninputs.

We will consider techniques which lead to solutionof the form

f (x) =∑i=1

ci K (x, xi ). (1)

where thexi , i = 1, . . . , ` are the input examples,Ka certain symmetric positive definite function namedkernel, andci a set of parameters to be determined formthe examples. This function is found by minimizing

functionals of the type

H [ f ] = 1

`

∑i=1

V(yi , f (xi ))+ λ‖ f ‖2K ,

whereV is a loss functionwhich measures the good-ness of the predicted outputf (xi ) with respect to thegiven outputyi , ‖ f ‖2K a smoothness term which canbe thought of as a norm in the Reproducing KernelHilbert Space defined by the kernelK andλ a positiveparameter which controls the relative weight betweenthe data and the smoothness term. The choice of theloss function determines different learning techniques,each leading to a different learning algorithm for com-puting the coefficientsci .

The rest of the paper is organized as follows.Section 2 presents the main idea and concepts in the the-ory. Section 3 discusses Regularization Networks andSupport Vector Machines, two important techniqueswhich produce outputs of the form of Eq. (1).

2. Statistical Learning Theory

We consider two sets of random variablesx ∈ X ⊆ Rd

andy∈Y ⊆ R related by a probabilistic relationship.

References

UPCLab2013
Typewritten text
REF [6]
Page 214: A quick introduction to Statistical Learning Theory and Support Vector Machines

10 Evgeniou, Pontil and Poggio

The relationship is probabilistic because generally anelement ofX does not determine uniquely an element ofY, but rather a probability distribution onY. This can beformalized assuming that an unknown probability dis-tribution P(x, y) is defined over the setX×Y. We areprovided withexamplesof this probabilistic relation-ship, that is with a data setD` ≡ {(xi , yi ) ∈ X×Y}`i=1calledtraining set, obtained by samplingtimes the setX×Y according toP(x, y). The “problem of learning”consists in, given the data setD`, providing anestima-tor, that is a functionf : X→Y, that can be used,given any value ofx ∈ X, to predict a valuey. For ex-ampleX could be the set of all possible images,Y theset{−1, 1}, and f (x) anindicator functionwhich spec-ifies whether imagex contains a certain object (y = 1),or not (y = −1) (see for example Papageorgiou et al.(1998)). Another example is the case wherex is a setof parameters, such as pose or facial expressions,y is amotion field relative to a particular reference image of aface, andf (x) is a regression function which maps pa-rameters to motion (see for example Ezzat and Poggio(1996)).

In SLT, the standard way to solve the learning prob-lem consists in defining arisk functional, which mea-sures the average amount of error or risk associatedwith an estimator, and then looking for the estimatorwith the lowest risk. IfV(y, f (x)) is the loss func-tion measuring the error we make when we predictyby f (x), then the average error, the so calledexpectedrisk, is:

I [ f ] ≡∫

X,YV(y, f (x))P(x, y) dx dy

We assume that the expected risk is defined on a “large”class of functionsF and we will denote byf0 the func-tion which minimizes the expected risk inF . The func-tion f0 is our ideal estimator, and it is often calledthe target function. This function cannot be found inpractice, because the probability distributionP(x, y)that defines the expected risk is unknown, and onlya sample of it, the data setD`, is available. To over-come this shortcoming we need aninduction princi-ple that we can use to “learn” from the limited num-ber of training data we have. SLT, as developed byVapnik (Vapnik, 1998), builds on the so-calledem-pirical risk minimization (ERM)induction principle.The ERM method consists in using the data setD`

to build a stochastic approximation of the expectedrisk, which is usually called theempirical risk, defined

as

Iemp[ f ; `] = 1

`

∑i=1

V(yi , f (xi )).

Straight minimization of the empirical risk inF canbe problematic. First, it is usually anill-posedprob-lem (Tikhonov and Arsenin, 1977), in the sense thatthere might be many, possibly infinitely many, func-tions minimizing the empirical risk. Second, it can leadto overfitting, meaning that although the minimum ofthe empirical risk can be very close to zero, the expectedrisk—which is what we are really interested in—canbe very large.

SLT provides probabilistic bounds on the distancebetween the empirical and expected risk of any function(therefore including the minimizer of the empirical riskin a function space that can be used to control overfit-ting). The bounds involve the number of examples`and thecapacity hof the function space, a quantitymeasuring the “complexity” of the space. Appropri-ate capacity quantities are defined in the theory, themost popular one being the VC-dimension (Vapnikand Chervonenkis, 1971) or scale sensitive versionsof it (Kearns and Shapire, 1994; Alon et al., 1993). Thebounds have the following general form: with proba-bility at leastη

I [ f ] < Iemp[ f ] +8(√

h

`, η

). (2)

where h is the capacity, and8 an increasing func-tion of h

`andη. For more information and for exact

forms of function8we refer the reader to (Vapnik andChervonenkis, 1971; Vapnik, 1998; Alon et al., 1993).Intuitively, if the capacity of the function space in whichwe perform empirical risk minimization is very largeand the number of examples is small, then the distancebetween the empirical and expected risk can be largeand overfitting is very likely to occur.

Since the spaceF is usually very large (e.g.F couldbe the space of square integrable functions), one typ-ically considers smaller hypothesis spacesH. More-over, inequality (2) suggests an alternative method forachieving good generalization: instead of minimizingthe empirical risk, find the best trade off between theempirical risk and thecomplexity of the hypothesisspacemeasured by the second term in the r.h.s. of in-equality (2). This observation leads to the method ofStructural Risk Minimization (SRM).

References

Page 215: A quick introduction to Statistical Learning Theory and Support Vector Machines

Statistical Learning Theory: A Primer 11

The idea of SRM is to define a nested sequence ofhypothesis spacesH1 ⊂ H2 ⊂ · · · ⊂ HM, where eachhypothesis spaceHm has finite capacityhm and largerthan that of all previous sets, that is:h1 ≤ h2, . . . ,≤hM. For exampleHm could be the set of polynomials ofdegreem, or a set of splines withmnodes, or some morecomplicated nonlinear parameterization. Using such anested sequence of more and more complex hypothesisspaces, SRM consists of choosing the minimizer of theempirical risk in the spaceHm∗ for which the bound onthestructural risk, as measured by the right hand sideof inequality (2), is minimized. Further informationabout the statistical properties of SRM can be found inDevroye et al. (1996), Vapnik (1998).

To summarize, in SLT the problem of learning fromexamples is solved in three steps: (a) we define a lossfunctionV(y, f (x)) measuring the error of predictingthe output of inputx with f (x)when the actual output isy; (b) we define a nested sequence of hypothesis spacesHm,m = 1, . . . ,M whose capacity is an increasingfunction of m; (c) we minimize the empirical risk ineach ofHm and choose, among the solutions found,the one with the best trade off between the empiricalrisk and the capacity as given by the right hand side ofinequality (2).

3. Learning Machines

3.1. Learning as Functional Minimization

We now consider hypothesis spaces which are sub-sets of a Reproducing Kernel Hilbert Space (RKHS)(Wahba, 1990). A RKHS is a Hilbert space of functionsf of the form f (x) =∑N

n=1 anφn(x), where{φn(x)}Nn=1is a set of given, linearly independent basis functionsand N can be possibly infinite. A RKHS is equippedwith a norm which is defined as:

‖ f ‖2K =N∑

n=1

a2n

λn,

where{λn}Nn=1 is a decreasing, positive sequence of realvalues whose sum is finite. The constantsλn and thebasis functions{φn}Nn=1 define the symmetric positivedefinite kernel function:

K (x, y) =N∑

n=1

λnφn(x)φn(y),

A nested sequence of spaces of functions in the RKHScan be constructed by bounding the RKHS norm offunctions in the space. This can be done by defining a

set of constantsA1 < A2 < · · · < AM and consideringspaces of the form:

Hm = { f ∈ RKHS: ‖ f ‖K ≤ Am}

It can be shown that the capacity of the hypothesisspacesHm is an increasing function ofAm (see for ex-ample Evgeniou et al. (1999)). According to the schemegiven at the end of Section 2, the solution of the learningproblem is found by solving, for eachAm, the followingoptimization problem:

min f

∑i=1

V(yi , f (xi ))

subject to ‖ f ‖K ≤ Am

and choosing, among the solutions found for eachAm,the one with the best trade off between empirical riskand capacity, i.e. the one which minimizes the boundon the structural risk as given by inequality (2).

The implementation of the SRM method describedabove is not practical because it requires to look for thesolution of a large number of constrained optimizationproblems. This difficulty is overcome by searching forthe minimum of:

H [ f ] = 1

`

∑i=1

V(yi , f (xi ))+ λ‖ f ‖2K. (3)

The functionalH [ f ] contains both the empirical riskand the norm (complexity or smoothness) off in theRKHS, similarly to functionals considered in regular-ization theory (Tikhonov and Arsenin, 1977). Thereg-ularization parameterλ penalizes functions with highcapacity: the largerλ, the smaller the RKHS norm ofthe solution will be.

When implementing SRM, the key issue is thechoice of the hypothesis space, i.e. the parameterHm

where the structural risk is minimized. In the case ofthe functional of Eq. (3), the key issue becomes thechoice of the regularization parameterλ. These twoproblems, as discussed in Evgeniou et al. (1999), arerelated, and the SRM method can in principle be usedto chooseλ (Vapnik, 1998). In practice, instead of usingSRM other methods are used such as cross-validation(Wahba, 1990), Generalized Cross Validation, FinitePrediction Error and the MDL criteria (see Vapnik(1998) for a review and comparison).

An important feature of the minimizer ofH [ f ]is that, independently on the loss functionV , the

References

Page 216: A quick introduction to Statistical Learning Theory and Support Vector Machines

12 Evgeniou, Pontil and Poggio

minimizer has the same general form (Wahba, 1990)

f (x) =∑i=1

ci K (x, xi ), (4)

Notice that Eq. (4) establishes a representation of thefunction f as a linear combination of kernels cen-tered in each data point. Using different kernels weget functions such as Gaussian radial basis functions(K (x, y)= exp(−β‖x − y‖2)), or polynomials of de-greed (K (x, y)= (1 + x · y)d) (Girosi et al., 1995;Vapnik, 1998).

We now turn to discuss a few learning techniquesbased on the minimization of functionals of the form(3) by specifying the loss functionV . In particular,we will consider Regularization Networks and SupportVector Machines (SVM), a learning technique whichhas recently been proposed for both classification andregression problems (see Vapnik (1998) and referencestherein):

– Regularization Networks

V(yi , f (xi )) = (yi − f (xi ))2, (5)

– SVM Classification

V(yi , f (xi )) = |1− yi f (xi )|+, (6)

where|x|+ = x if x > 0 and zero otherwise.– SVM Regression

V(yi , f (xi )) = |yi − f (xi )|ε, (7)

where the function|·|ε , calledε-insensitive loss, is de-fined as:

|x|ε ≡{

0 if |x| < ε

|x| − ε otherwise.(8)

We now briefly discuss each of these three techniques.

3.2. Regularization Networks

The approximation scheme that arises from the mini-mization of the quadratic functional

1

`

∑i=1

(yi − f (xi ))2+ λ‖ f ‖2K (9)

for a fixedλ is a special form of regularization. It ispossible to show (see for example Girosi et al. (1995))that the coefficientsci of the minimizer of (9) in Eq. (4)satisfy the following linear system of equations:

(G+ λI )c= y, (10)

whereI is the identity matrix, and we have defined

(y)i = yi , (c)i = ci , (G)i j = K (xi , x j ).

Since the coefficientsci satisfy a linear system, Eq. (4)can be rewritten as:

f (x) =∑i=1

yi bi (x), (11)

with bi (x) =∑`

j=1(G+ λI )−1i j K (xi , x). Equation (11)

gives the dual representation of RN. Notice the dif-ference between Eqs. (4) and (11): in the first one thecoefficientsci are learned from the data while in thesecond one the bases functionsbi are learned, the co-efficient of the expansion being equal to the output ofthe examples. We refer to (Girosi et al., 1995) for moreinformation on the dual representation.

3.3. Support Vector Machines

We now discuss Support Vector Machines (SVM)(Cortes and Vapnik, 1995; Vapnik, 1998). We distin-guish between real output (regression) and binary out-put (classification) problems. The method of SVM re-gression corresponds to the following minimization:

Min f1

`

∑i=1

|yi − f (xi )|ε + λ‖ f ‖2K (12)

while the method of SVM classification correspondsto:

Min f1

`

∑i=1

|1− yi f (xi )|+ + λ‖ f ‖2K, (13)

It turns out that for both problems (12) and (13) thecoefficientsci in Eq. (4) can be found by solving aQuadratic Programming (QP) problem with linear con-straints. The regularization parameterλ appears onlyin the linear constraints: the absolute values of coef-ficients ci is bounded by2

λ. The QP problem is non

References

Page 217: A quick introduction to Statistical Learning Theory and Support Vector Machines

Statistical Learning Theory: A Primer 13

trivial since the size of matrix of the quadratic form isequal to × ` and the matrix is dense. A number of al-gorithms for training SVM have been proposed: someare based on a decomposition approach where the QPproblem is attacked by solving a sequence of smallerQP problems (Osuna et al., 1997), others on sequentialupdates of the solution (?).

A remarkable property of SVMs is that loss functions(7) and (6) lead tosparsesolutions. This means that,unlike in the case of Regularization Networks, typicallyonly a small fraction of the coefficientsci in Eq. (4) arenonzero. The data pointsxi associated with the nonzeroci are calledsupport vectors. If all data points whichare not support vectors were to be discarded from thetraining set the same solution would be found. In thiscontext, an interesting perspective on SVM is to con-sider its information compression properties. The sup-port vectors represent the most informative data pointsand compress the information contained in the trainingset: for the purpose of, say, classification only the sup-port vectors need to be stored, while all other trainingexamples can be discarded. This, along with some geo-metric properties of SVMs such as the interpretation ofthe RKHS norm of their solution as the inverse of themargin(Vapnik, 1998), is a key property of SVM andmight explain why this technique works well in manypractical applications.

3.4. Kernels and Data Representations

We conclude this short review with a discussion on ker-nels and data representations. A key issue when usingthe learning techniques discussed above is the choiceof the kernelK in Eq. (4). The kernelK (xi , x j ) de-fines a dot product between the projections of the twoinputsxi andx j , in the feature space (the features being{φ1(x), φ2(x), . . . , φN(x)} with N the dimensionalityof the RKHS). Therefore its choice is closely related tothe choice of the “effective” representation of the data,i.e. the image representation in a vision application.

The problem of choosing the kernel for the machinesdiscussed here, and more generally the issue of find-ing appropriate data representations for learning, is animportant and open one. The theory does not providea general method for finding “good” data representa-

tions, but suggests representations that lead to “sim-ple” solutions. Although there is not a general solutionto this problem, a number of recent experimental andtheoretical works provide insights for specific appli-cations (Evgeniou et al., 2000; Jaakkola and Haussler,1998; Mohan, 1999; Vapnik, 1998).

References

Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. 1993.Scale-sensitive dimensions, uniform convergence, and learnabil-ity. Symposium on Foundations of Computer Science.

Cortes, C. and Vapnik, V. 1995. Support vector networks.MachineLearning, 20:1–25.

Devroye, L., Gyorfi, L., and Lugosi, G. 1996.A Probabilistic Theoryof Pattern Recognition, No. 31 in Applications of Mathematics.Springer: New York.

Evgeniou, T., Pontil, M., Papageorgiou, C., and Poggio, T. 2000.Image representations for object detection using kernel classifiers.In Proceedings ACCV. Taiwan, p. To appear.

Evgeniou, T., Pontil, M., and Poggio, T. 1999. A unified frameworkfor Regularization Networks and Support Vector Machines. A.I.Memo No. 1654, Artificial Intelligence Laboratory, MassachusettsInstitute of Technology.

Ezzat, T. and Poggio, T. 1996. Facial analysis and synthesis usingimage-based models. InFace and Gesture Recognition. pp. 116–121.

Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory andneural networks architectures.Neural Computation, 7:219–269.

Jaakkola, T. and Haussler, D. 1998. Probabilistic kernel regressionmodels. InProc. of Neural Information Processing Conference.

Kearns, M. and Shapire, R. 1994. Efficient distribution-free learn-ing of probabilistic concepts.Journal of Computer and SystemsSciences, 48(3):464–497.

Mohan, A. 1999. Robust object detection in images by components.Master’s Thesis, Massachusetts Institute of Technology.

Osuna, E., Freund, R., and Girosi, F. 1997. An improved training al-gorithm for support vector machines. InIEEE Workshop on NeuralNetworks and Signal Processing, Amelia Island, FL.

Papageorgiou, C., Oren, M., and Poggio, T. 1998. A general frame-work for object detection. InProceedings of the International Con-ference on Computer Vision, Bombay, India.

Platt, J.C. 1998. Sequential minimal imization: A fast algorithm fortraining support vector machines. Technical Report MST-TR-98-14, Microsoft Research.

Tikhonov, A.N. and Arsenin, V.Y. 1977.Solutions of Ill-posed Prob-lems. Washington, D.C.: W.H. Winston.

Vapnik, V.N. 1998.Statistical Learning Theory. Wiley: New York.Vapnik, V.N. and Chervonenkis, A.Y. 1971. On the uniform conver-

gence of relative frequences of events to their probabilities.Th.Prob. and its Applications, 17(2):264–280.

Wahba, G. 1990.Splines Models for Observational Data. Vol. 59,Series in Applied Mathematics: Philadelphia.

References

Page 218: A quick introduction to Statistical Learning Theory and Support Vector Machines

Statistical Learning and Kernel MethodsBernhard Sch�olkopfMicrosoft Research Limited,1 Guildhall Street, Cambridge CB2 3NH, [email protected]://research.microsoft.com/�bscFebruary 29, 2000Technical ReportMSR-TR-2000-23Microsoft ResearchMicrosoft CorporationOne Microsoft WayRedmond, WA 98052

Lecture notes for a course to be taught at the Interdisciplinary College 2000,G�unne, Germany, March 2000.References

UPCLab2013
Typewritten text
REF [7]
Page 219: A quick introduction to Statistical Learning Theory and Support Vector Machines

AbstractWe brie y describe the main ideas of statistical learning theory, sup-port vector machines, and kernel feature spaces.

Contents1 An Introductory Example 12 Learning Pattern Recognition from Examples 43 Hyperplane Classi�ers 54 Support Vector Classi�ers 85 Support Vector Regression 116 Further Developments 147 Kernels 158 Representing Similarities in Linear Spaces 189 Examples of Kernels 2110 Representating Dissimilarities in Linear Spaces 22References

Page 220: A quick introduction to Statistical Learning Theory and Support Vector Machines

1 An Introductory ExampleSuppose we are given empirical data(x1; y1); : : : ; (xm; ym) 2 X � f�1g: (1)Here, the domain X is some nonempty set that the patterns xi are taken from;the yi are called labels or targets .Unless stated otherwise, indices i and j will always be understood to runover the training set, i.e. i; j = 1; : : : ;m.Note that we have not made any assumptions on the domain X other thanit being a set. In order to study the problem of learning, we need additionalstructure. In learning, we want to be able to generalize to unseen data points.In the case of pattern recognition, this means that given some new patternx 2 X , we want to predict the corresponding y 2 f�1g. By this we mean,loosely speaking, that we choose y such that (x; y) is in some sense similar tothe training examples. To this end, we need similarity measures in X and inf�1g. The latter is easy, as two target values can only be identical or di�erent.For the former, we require a similarity measurek : X �X ! R;(x; x0) 7! k(x; x0); (2)i.e., a function that, given two examples x and x0, returns a real number char-acterizing their similarity. For reasons that will become clear later, the functionk is called a kernel [13, 1, 8].A type of similarity measure that is of particular mathematical appeal aredot products. For instance, given two vectors x;x0 2 RN , the canonical dotproduct is de�ned as (x � x0) := NXi=1(x)i(x0)i: (3)Here, (x)i denotes the i-th entry of x.The geometrical interpretation of this dot product is that it computes thecosine of the angle between the vectors x and x0, provided they are normalizedto length 1. Moreover, it allows computation of the length of a vector x asp(x � x), and of the distance between two vectors as the length of the di�erencevector. Therefore, being able to compute dot products amounts to being ableto carry out all geometrical constructions that can be formulated in terms ofangles, lenghts and distances.Note, however, that we have not made the assumption that the patterns livein a dot product space. In order to be able to use a dot product as a similaritymeasure, we therefore �rst need to embed them into some dot product space F ,which need not be identical to RN . To this end, we use a map� : X ! Fx 7! x: (4)1

References

Page 221: A quick introduction to Statistical Learning Theory and Support Vector Machines

The space F is called a feature space. To summarize, embedding the data intoF has three bene�ts.1. It lets us de�ne a similarity measure from the dot product in F ,k(x; x0) := (x � x0) = (�(x) � �(x0)): (5)2. It allows us to deal with the patterns geometrically, and thus lets us studylearning algorithm using linear algebra and analytic geometry.3. The freedom to choose the mapping � will enable us to design a largevariety of learning algorithms. For instance, consider a situation where theinputs already live in a dot product space. In that case, we could directlyde�ne a similarity measure as the dot product. However, we might stillchoose to �rst apply a nonlinear map � to change the representation intoone that is more suitable for a given problem and learning algorithm.We are now in the position to describe a pattern recognition learning algo-rithm that is arguably one of the simplest possible. The basic idea is to computethe means of the two classes in feature space,c1 = 1m1 Xfi:yi=+1gxi; (6)c2 = 1m2 Xfi:yi=�1gxi; (7)where m1 and m2 are the number of examples with positive and negative labels,respectively. We then assign a new point x to the class whose mean is closer toit. This geometrical construction can be formulated in terms of dot products.Half-way in between c1 and c2 lies the point c := (c1 + c2)=2. We compute theclass of x by checking whether the vector connecting c and x encloses an anglesmaller than �=2 with the vector w := c1 � c2 connecting the class means, inother words y = sgn ((x� c) �w)y = sgn ((x� (c1 + c2)=2) � (c1 � c2))= sgn ((x � c1)� (x � c2) + b): (8)Here, we have de�ned the o�setb := 12 �kc2k2 � kc1k2� : (9)It will prove instructive to rewrite this expression in terms of the patternsxi in the input domain X . To this end, note that we do not have a dot productin X , all we have is the similarity measure k (cf. (5)). Therefore, we need to2

References

Page 222: A quick introduction to Statistical Learning Theory and Support Vector Machines

rewrite everything in terms of the kernel k evaluated on input patterns. To thisend, substitute (6) and (7) into (8) to get the decision functiony = sgn 0@ 1m1 Xfi:yi=+1g(x � xi)� 1m2 Xfi:yi=�1g(x � xi) + b1A= sgn 0@ 1m1 Xfi:yi=+1g k(x; xi)� 1m2 Xfi:yi=�1g k(x; xi) + b1A : (10)Similarly, the o�set becomesb := 12 0@ 1m22 Xf(i;j):yi=yj=�1g k(xi; xj)� 1m21 Xf(i;j):yi=yj=+1g k(xi; xj)1A : (11)Let us consider one well-known special case of this type of classi�er. Assumethat the class means have the same distance to the origin (hence b = 0), andthat k can be viewed as a density, i.e. it is positive and has integral 1,ZX k(x; x0)dx = 1 for all x0 2 X : (12)In order to state this assumption, we have to require that we can de�ne anintegral on X .If the above holds true, then (10) corresponds to the so-called Bayes deci-sion boundary separating the two classes, subject to the assumption that thetwo classes were generated from two probability distributions that are correctlyestimated by the Parzen windows estimators of the two classes,p1(x) := 1m1 Xfi:yi=+1g k(x; xi) (13)p2(x) := 1m2 Xfi:yi=�1g k(x; xi): (14)Given some point x, the label is then simply computed by checking which of thetwo, p1(x) or p2(x), is larger, which directly leads to (10). Note that this decisionis the best we can do if we have no prior information about the probabilities ofthe two classes.The classi�er (10) is quite close to the types of learning machines that wewill be interested in. It is linear in the feature space, while in the input domain,it is represented by a kernel expansion. It is example-based in the sense thatthe kernels are centered on the training examples, i.e. one of the two argumentsof the kernels is always a training example. The main point where the moresophisticated techniques to be discussed later will deviate from (10) is in theselection of the examples that the kernels are centered on, and in the weightthat is put on the individual kernels in the decision function. Namely, it will no3

References

Page 223: A quick introduction to Statistical Learning Theory and Support Vector Machines

longer be the case that all training examples appear in the kernel expansion,and the weights of the kernels in the expansion will no longer be uniform. Inthe feature space representation, this statement corresponds to saying that wewill study all normal vectors w of decision hyperplanes that can be representedas linear combinations of the training examples. For instance, we might wantto remove the in uence of patterns that are very far away from the decisionboundary, either since we expect that they will not improve the generalizationerror of the decision function, or since we would like to reduce the computationalcost of evaluating the decision function (cf. (10)). The hyperplane will then onlydepend on a subset of training examples, called support vectors.2 Learning Pattern Recognition from ExamplesWith the above example in mind, let us now consider the problem of patternrecognition in a more formal setting [27, 28], following the introduction of [19].In two-class pattern recognition, we seek to estimate a functionf : X ! f�1g (15)based on input-output training data (1). We assume that the data were gen-erated independently from some unknown (but �xed) probability distributionP (x; y). Our goal is to learn a function that will correctly classify unseen exam-ples (x; y), i.e. we want f(x) = y for examples (x; y) that were also generatedfrom P (x; y).If we put no restriction on the class of functions that we choose our esti-mate f from, however, even a function which does well on the training data,e.g. by satisfying f(xi) = yi for all i = 1; : : : ;m, need not generalize well tounseen examples. To see this, note that for each function f and any test set(�x1; �y1); : : : ; (�x �m; �y �m) 2 RN �f�1g; satisfying f�x1; : : : ; �x �mg\fx1; : : : ; xmg = fg,there exists another function f� such that f�(xi) = f(xi) for all i = 1; : : : ;m,yet f�(�xi) 6= f(�xi) for all i = 1; : : : ; �m. As we are only given the training data,we have no means of selecting which of the two functions (and hence which ofthe completely di�erent sets of test label predictions) is preferable. Hence, onlyminimizing the training error (or empirical risk),Remp[f ] = 1m mXi=1 12 jf(xi)� yij; (16)does not imply a small test error (called risk), averaged over test examplesdrawn from the underlying distribution P (x; y),R[f ] = Z 12 jf(x)� yj dP (x; y): (17)Statistical learning theory [31, 27, 28, 29], or VC (Vapnik-Chervonenkis) theory,shows that it is imperative to restrict the class of functions that f is chosen4

References

Page 224: A quick introduction to Statistical Learning Theory and Support Vector Machines

from to one which has a capacity that is suitable for the amount of availabletraining data. VC theory provides bounds on the test error. The minimizationof these bounds, which depend on both the empirical risk and the capacity ofthe function class, leads to the principle of structural risk minimization [27].The best-known capacity concept of VC theory is the VC dimension, de�ned asthe largest number h of points that can be separated in all possible ways usingfunctions of the given class. An example of a VC bound is the following: ifh < m is the VC dimension of the class of functions that the learning machinecan implement, then for all functions of that class, with a probability of at least1� �, the bound R(�) � Remp(�) + �� hm; log(�)m � (18)holds, where the con�dence term � is de�ned as�� hm; log(�)m � =sh �log 2mh + 1�� log(�=4)m : (19)Tighter bounds can be formulated in terms of other concepts, such as the an-nealed VC entropy or the Growth function. These are usually considered to beharder to evaluate, but they play a fundamental role in the conceptual part ofVC theory [28]. Alternative capacity concepts that can be used to formulatebounds include the fat shattering dimension [2].The bound (18) deserves some further explanatory remarks. Suppose wewanted to learn a \dependency" where P (x; y) = P (x) � P (y), i.e. where thepattern x contains no information about the label y, with uniform P (y). Given atraining sample of �xed size, we can then surely come up with a learning machinewhich achieves zero training error (provided we have no examples contradictingeach other). However, in order to reproduce the random labellings, this machinewill necessarily require a large VC dimension h. Thus, the con�dence term(19), increasing monotonically with h, will be large, and the bound (18) willnot support possible hopes that due to the small training error, we shouldexpect a small test error. This makes it understandable how (18) can holdindependent of assumptions about the underlying distribution P (x; y): it alwaysholds (provided that h < m), but it does not always make a nontrivial prediction| a bound on an error rate becomes void if it is larger than the maximum errorrate. In order to get nontrivial predictions from (18), the function space mustbe restricted such that the capacity (e.g. VC dimension) is small enough (inrelation to the available amount of data).3 Hyperplane Classi�ersIn the present section, we shall describe a hyperplane learning algorithm thatcan be performed in a dot product space (such as the feature space that weintroduced previously). As described in the previous section, to design learning5

References

Page 225: A quick introduction to Statistical Learning Theory and Support Vector Machines

algorithms, one needs to come up with a class of functions whose capacity canbe computed.[32] and [30] considered the class of hyperplanes(w � x) + b = 0 w 2 RN ; b 2 R; (20)corresponding to decision functionsf(x) = sgn ((w � x) + b); (21)and proposed a learning algorithm for separable problems, termed the Gen-eralized Portrait, for constructing f from empirical data. It is based on twofacts. First, among all hyperplanes separating the data, there exists a uniqueone yielding the maximum margin of separation between the classes,maxw;b minfkx� xik : x 2 RN ; (w � x) + b = 0; i = 1; : : : ;mg: (22)Second, the capacity decreases with increasing margin.To construct this Optimal Hyperplane (cf. Figure 1), one solves the followingoptimization problem:minimize �(w) = 12kwk2 (23)subject to yi � ((w � xi) + b) � 1; i = 1; : : : ;m: (24)This constrained optimization problem is dealt with by introducing Lagrangemultipliers �i � 0 and a LagrangianL(w; b;�) = 12kwk2 � mXi=1 �i (yi � ((xi �w) + b)� 1) : (25)The Lagrangian L has to be minimized with respect to the primal variables wand b and maximized with respect to the dual variables �i (i.e. a saddle pointhas to be found). Let us try to get some intuition for this. If a constraint (24)is violated, then yi � ((w � xi) + b)� 1 < 0, in which case L can be increased byincreasing the corresponding �i. At the same time, w and b will have to changesuch that L decreases. To prevent ��i (yi � ((w � xi) + b)� 1) from becomingarbitrarily large, the change in w and b will ensure that, provided the problem isseparable, the constraint will eventually be satis�ed. Similarly, one can under-stand that for all constraints which are not precisely met as equalities, i.e. forwhich yi � ((w �xi)+ b)�1 > 0, the corresponding �i must be 0: this is the valueof �i that maximizes L. The latter is the statement of the Karush-Kuhn-Tuckercomplementarity conditions of optimization theory [6].The condition that at the saddle point, the derivatives of L with respect tothe primal variables must vanish,@@bL(w; b;�) = 0; @@wL(w; b;�) = 0; (26)6

References

Page 226: A quick introduction to Statistical Learning Theory and Support Vector Machines

.w

{x | (w x) + b = 0}.

{x | (w x) + b = −1}.{x | (w x) + b = +1}.

x2x1

Note:

(w x1) + b = +1(w x2) + b = −1

=> (w (x1−x2)) = 2

=> (x1−x2) =w

||w||( )

.

.

.

. 2||w||

yi = −1

yi = +1❍❍

❍❍

Figure 1: A binary classi�cation toy problem: separate balls from diamonds.The optimal hyperplane is orthogonal to the shortest line connecting the convexhulls of the two classes (dotted), and intersects it half-way between the twoclasses. The problem being separable, there exists a weight vector w and athreshold b such that yi � ((w � xi) + b) > 0 (i = 1; : : : ;m). Rescaling w and bsuch that the point(s) closest to the hyperplane satisfy j(w � xi) + bj = 1, weobtain a canonical form (w; b) of the hyperplane, satisfying yi � ((w �xi)+b) � 1.Note that in this case, the margin, measured perpendicularly to the hyperplane,equals 2=kwk. This can be seen by considering two points x1;x2 on oppositesides of the margin, i.e. (w �x1) + b = 1; (w �x2) + b = �1, and projecting themonto the hyperplane normal vector w=kwk.leads to mXi=1 �iyi = 0 (27)and w = mXi=1 �iyixi: (28)The solution vector thus has an expansion in terms of a subset of the trainingpatterns, namely those patterns whose �i is non-zero, called Support Vectors.By the Karush-Kuhn-Tucker complementarity conditions�i � [yi((xi �w) + b)� 1] = 0; i = 1; : : : ;m; (29)the Support Vectors lie on the margin (cf. Figure 1). All remaining examples ofthe training set are irrelevant: their constraint (24) does not play a role in theoptimization, and they do not appear in the expansion (28). This nicely capturesour intuition of the problem: as the hyperplane (cf. Figure 1) is completely7

References

Page 227: A quick introduction to Statistical Learning Theory and Support Vector Machines

determined by the patterns closest to it, the solution should not depend on theother examples.By substituting (27) and (28) into L, one eliminates the primal variables andarrives at the Wolfe dual of the optimization problem [e.g. 6]: �nd multipliers�i which maximize W (�) = mXi=1 �i � 12 mXi;j=1�i�jyiyj(xi � xj) (30)subject to �i � 0; i = 1; : : : ;m; and mXi=1 �iyi = 0: (31)The hyperplane decision function can thus be written asf(x) = sgn mXi=1 yi�i � (x � xi) + b! (32)where b is computed using (29).The structure of the optimization problem closely resembles those that typ-ically arise in Lagrange's formulation of mechanics. Also there, often only asubset of the constraints become active. For instance, if we keep a ball in a box,then it will typically roll into one of the corners. The constraints correspondingto the walls which are not touched by the ball are irrelevant, the walls couldjust as well be removed.Seen in this light, it is not too surprising that it is possible to give a me-chanical interpretation of optimal margin hyperplanes [9]: If we assume thateach support vector xi exerts a perpendicular force of size �i and sign yi ona solid plane sheet lying along the hyperplane, then the solution satis�es therequirements of mechanical stability. The constraint (27) states that the forceson the sheet sum to zero; and (28) implies that the torques also sum to zero,via Pi xi � yi�i �w=kwk = w�w=kwk = 0.There are theoretical arguments supporting the good generalization perfor-mance of the optimal hyperplane ([31, 27, 35, 4]). In addition, it is computation-ally attractive, since it can be constructed by solving a quadratic programmingproblem.4 Support Vector Classi�ersWe now have all the tools to describe support vector machines [28, 19, 26].Everything in the last section was formulated in a dot product space. We thinkof this space as the feature space F described in Section 1. To express theformulas in terms of the input patterns living in X , we thus need to employ (5),which expresses the dot product of bold face feature vectors x;x0 in terms ofthe kernel k evaluated on input patterns x; x0,k(x; x0) = (x � x0): (33)8

References

Page 228: A quick introduction to Statistical Learning Theory and Support Vector Machines

feature spaceinput space

Φ

◆◆

◆❍

❍❍

❍❍

❍Figure 2: The idea of SV machines: map the training data into a higher-dimensional feature space via �, and construct a separating hyperplane withmaximum margin there. This yields a nonlinear decision boundary in inputspace. By the use of a kernel function (2), it is possible to compute the separat-ing hyperplane without explicitly carrying out the map into the feature space.This can be done since all feature vectors only occured in dot products. Theweight vector (cf. (28)) then becomes an expansion in feature space, and willthus typically no more correspond to the image of a single vector from inputspace. We thus obtain decision functions of the more general form (cf. (32))f(x) = sgn mXi=1 yi�i � (�(x) ��(xi)) + b!= sgn mXi=1 yi�i � k(x; xi) + b! ; (34)and the following quadratic program (cf. (30)):maximize W (�) = mXi=1 �i � 12 mXi;j=1�i�jyiyjk(xi; xj) (35)subject to �i � 0; i = 1; : : : ;m; and mXi=1 �iyi = 0: (36)In practice, a separating hyperplane may not exist, e.g. if a high noise levelcauses a large overlap of the classes. To allow for the possibility of examplesviolating (24), one introduces slack variables [10, 28, 22]�i � 0; i = 1; : : : ;m (37)in order to relax the constraints toyi � ((w � xi) + b) � 1� �i; i = 1; : : : ;m: (38)A classi�er which generalizes well is then found by controlling both the classi�ercapacity (via kwk) and the sum of the slacksPi �i. The latter is done as it can9

References

Page 229: A quick introduction to Statistical Learning Theory and Support Vector Machines

Figure 3: Example of a Support Vector classi�er found by using a radial basisfunction kernel k(x; x0) = exp(�kx�x0k2). Both coordinate axes range from -1to +1. Circles and disks are two classes of training examples; the middle line isthe decision surface; the outer lines precisely meet the constraint (24). Note thatthe Support Vectors found by the algorithm (marked by extra circles) are notcenters of clusters, but examples which are critical for the given classi�cationtask. Grey values code the modulus of the argumentPmi=1 yi�i � k(x; xi) + b ofthe decision function (34).)be shown to provide an upper bound on the number of training errors whichleads to a convex optimization problem.One possible realization of a soft margin classi�er is minimizing the objectivefunction �(w; �) = 12kwk2 + C mXi=1 �i (39)subject to the constraints (37) and (38), for some value of the constant C > 0determining the trade-o�. Here and below, we use boldface greek letters asa shorthand for corresponding vectors � = (�1; : : : ; �m). Incorporating kernels,and rewriting it in terms of Lagrange multipliers, this again leads to the problemof maximizing (35), subject to the constraints0 � �i � C; i = 1; : : : ;m; and mXi=1 �iyi = 0: (40)The only di�erence from the separable case is the upper bound C on the La-grange multipliers �i. This way, the in uence of the individual patterns (which10

References

Page 230: A quick introduction to Statistical Learning Theory and Support Vector Machines

x

x

xx

x

xxx

x

xx

x

x

x

+ε−ε

x

ξ+ε

−ε0

ξ

Figure 4: In SV regression, a tube with radius " is �tted to the data. Thetrade-o� between model complexity and points lying outside of the tube (withpositive slack variables �) is determined by minimizing (46).could be outliers) gets limited. As above, the solution takes the form (34). Thethreshold b can be computed by exploiting the fact that for all SVs xi with�i < C, the slack variable �i is zero (this again follows from the Karush-Kuhn-Tucker complementarity conditions), and hencemXj=1 yj�j � k(xi; xj) + b = yi: (41)Another possible realization of a soft margin variant of the optimal hyper-plane uses the �-parametrization [22]. In it, the paramter C is replaced by aparameter � 2 [0; 1] which can be shown to lower and upper bound the numberof examples that will be SVs and that will come to lie on the wrong side of thehyperplane, respectively. It uses a primal objective function with the error term1�mPi �i � �, and separation constraintsyi � ((w � xi) + b) � �� �i; i = 1; : : : ;m: (42)The margin parameter � is a variable of the optimization problem. The dualcan be shown to consist of maximizing the quadratic part of (35), subject to0 � �i � 1=(�m), Pi �iyi = 0 and the additional constraint Pi �i = 1.5 Support Vector RegressionThe concept of the margin is speci�c to pattern recognition. To generalizethe SV algorithm to regression estimation [28], an analogue of the margin isconstructed in the space of the target values y (note that in regression, we havey 2 R) by using Vapnik's "-insensitive loss function (Figure 4)jy � f(x)j" := maxf0; jy � f(x)j � "g: (43)11

References

Page 231: A quick introduction to Statistical Learning Theory and Support Vector Machines

To estimate a linear regressionf(x) = (w � x) + b (44)with precision ", one minimizes12kwk2 + C mXi=1 jyi � f(xi)j": (45)Written as a constrained optimization problem, this reads:minimize �(w; �; ��) = 12kwk2 + C mXi=1(�i + ��i ) (46)subject to ((w � xi) + b)� yi � "+ �i (47)yi � ((w � xi) + b) � "+ ��i (48)�i; ��i � 0 (49)for all i = 1; : : : ;m. Note that according to (47) and (48), any error smaller than" does not require a nonzero �i or ��i , and hence does not enter the objectivefunction (46).Generalization to kernel-based regression estimation is carried out in com-plete analogy to the case of pattern recognition. Introducing Lagrange multi-pliers, one thus arrives at the following optimization problem: for C > 0; " � 0chosen a priori,maximize W (�;��) = �" mXi=1(��i + �i) + mXi=1(��i � �i)yi�12 mXi;j=1(��i � �i)(��j � �j)k(xi; xj) (50)subject to 0 � �i; ��i � C; i = 1; : : : ;m; and mXi=1(�i � ��i ) = 0:(51)The regression estimate takes the formf(x) = mXi=1(��i � �i)k(xi; x) + b; (52)where b is computed using the fact that (47) becomes an equality with �i = 0 if0 < �i < C, and (48) becomes an equality with ��i = 0 if 0 < ��i < C.Several extensions of this algorithm are possible. From an abstract pointof view, we just need some target function which depends on the vector (w; �)(cf. (46)). There are multiple degrees of freedom for constructing it, includingsome freedom how to penalize, or regularize, di�erent parts of the vector, andsome freedom how to use the kernel trick. For instance, more general loss12

References

Page 232: A quick introduction to Statistical Learning Theory and Support Vector Machines

Σ

. . .

output σ (Σ υi k (x,xi))

weightsυ1 υ2 υm

. . .

. . .

test vector x

support vectors x1 ... xn

mapped vectors Φ(xi), Φ(x)Φ(x) Φ(xn)

dot product (Φ(x).Φ(xi)) = k (x,xi)( . ) ( . ) ( . )

Φ(x1) Φ(x2)

σ ( )

Figure 5: Architecture of SV machines. The input x and the Support Vectors xiare nonlinearly mapped (by �) into a feature space F , where dot products arecomputed. By the use of the kernel k, these two layers are in practice computedin one single step. The results are linearly combined by weights �i, found bysolving a quadratic program (in pattern recognition, �i = yi�i; in regressionestimation, �i = ��i ��i). The linear combination is fed into the function � (inpattern recognition, �(x) = sgn (x+ b); in regression estimation, �(x) = x+ b).functions can be used for �, leading to problems that can still be solved e�ciently[24]. Moreover, norms other than the 2-norm k:k can be used to regularize thesolution. Yet another example is that polynomial kernels can be incorporatedwhich consist of multiple layers, such that the �rst layer only computes productswithin certain speci�ed subsets of the entries of w [17].Finally, the algorithm can be modi�ed such that " need not be speci�ed apriori. Instead, one speci�es an upper bound 0 � � � 1 on the fraction ofpoints allowed to lie outside the tube (asymptotically, the number of SVs) andthe corresponding " is computed automatically. This is achieved by using asprimal objective function12kwk2 + C �m"+ mXi=1 jyi � f(xi)j"! (53)instead of (45), and treating " � 0 as a parameter that we minimize over [22].13

References

Page 233: A quick introduction to Statistical Learning Theory and Support Vector Machines

6 Further DevelopmentsHaving described the basics of SV machines, we now summarize some empirical�ndings and theoretical developments which were to follow.By the use of kernels, the optimal margin classi�er was turned into a classi�erwhich became a serious competitor of high-performance classi�ers. Surprisingly,it was noticed that when di�erent kernel functions are used in SV machines, theyempirically lead to very similar classi�cation accuracies and SV sets [18]. In thissense, the SV set seems to characterize (or compress) the given task in a mannerwhich up to a certain degree is independent of the type of kernel (i.e. the typeof classi�er) used.Initial work at AT&T Bell Labs focused on OCR (optical character recog-nition), a problem where the two main issues are classi�cation accuracy andclassi�cation speed. Consequently, some e�ort went into the improvement ofSV machines on these issues, leading to the Virtual SV method for incorporat-ing prior knowledge about transformation invariances by transforming SVs, andthe Reduced Set method for speeding up classi�cation. This way, SV machinesbecame competitive with the best available classi�ers on both OCR and objectrecognition tasks [7, 9, 17].Another initial weakness of SV machines, less apparent in OCR applicationswhich are characterized by low noise levels, was that the size of the quadraticprogramming problem scaled with the number of Support Vectors. This wasdue to the fact that in (35), the quadratic part contained at least all SVs |the common practice was to extract the SVs by going through the training datain chunks while regularly testing for the possibility that some of the patternsthat were initially not identi�ed as SVs turn out to become SVs at a later stage(note that without chunking, the size of the matrix would be m�m, where mis the number of all training examples). What happens if we have a high-noiseproblem? In this case, many of the slack variables �i will become nonzero, andall the corresponding examples will become SVs. For this case, a decompositionalgorithm was proposed [14], which is based on the observation that not onlycan we leave out the non-SV examples (i.e. the xi with �i = 0) from the currentchunk, but also some of the SVs, especially those that hit the upper boundary(i.e. �i = C). In fact, one can use chunks which do not even contain all SVs,and maximize over the corresponding sub-problems. SMO [15, 25, 20] exploresan extreme case, where the sub-problems are chosen so small that one cansolve them analytically. Several public domain SV packages and optimizers arelisted on the web page http://www.kernel-machines.org. For more details onthe optimization problem, see [19].On the theoretical side, the least understood part of the SV algorithm ini-tially was the precise role of the kernel, and how a certain kernel choice wouldin uence the generalization ability. In that respect, the connection to regular-ization theory provided some insight. For kernel-based function expansions, onecan show that given a regularization operator P mapping the functions of thelearning machine into some dot product space, the problem of minimizing the14

References

Page 234: A quick introduction to Statistical Learning Theory and Support Vector Machines

regularized risk Rreg [f ] = Remp[f ] + �2 kPfk2 (54)(with a regularization parameter � � 0) can be written as a constrained opti-mization problem. For particular choices of the loss function, it further reducesto a SV type quadratic programming problem. The latter thus is not speci�c toSV machines, but is common to a much wider class of approaches. What getslost in the general case, however, is the fact that the solution can usually be ex-pressed in terms of a small number of SVs. This speci�c feature of SV machinesis due to the fact that the type of regularization and the class of functions thatthe estimate is chosen from are intimately related [11, 23]: the SV algorithm isequivalent to minimizing the regularized risk on the set of functionsf(x) =Xi �ik(xi; x) + b; (55)provided that k and P are interrelated byk(xi; xj) = ((Pk)(xi; :) � (Pk)(xj ; :)) : (56)To this end, k is chosen as a Green's function of P �P , for in that case, the righthand side of (56) equals(k(xi; :) � (P �Pk)(xj ; :)) = (k(xi; :) � �xj (:)) = k(xi; xj): (57)For instance, a Gaussian RBF kernel thus corresponds to regularization with afunctional containing a speci�c di�erential operator.In SV machines, the kernel thus plays a dual role: �rstly, it determines theclass of functions (55) that the solution is taken from; secondly, via (56), thekernel determines the type of regularization that is used.We conclude this section by noticing that the kernel method for computingdot products in feature spaces is not restricted to SV machines. Indeed, it hasbeen pointed out that it can be used to develop nonlinear generalizations of anyalgorithm that can be cast in terms of dot products, such as principal componentanalysis [21], and a number of developments have followed this example.7 KernelsWe now take a closer look at the issue of the similarity measure, or kernel, k.In this section, we think of X as a subset of the vector space RN , (N 2 N),endowed with the canonical dot product (3).7.1 Product FeaturesSuppose we are given patterns x 2 RN where most information is contained inthe d-th order products (monomials) of entries [x]j of x,[x]j1 � : : : � [x]jd ; (58)15

References

Page 235: A quick introduction to Statistical Learning Theory and Support Vector Machines

where j1; : : : ; jd 2 f1; : : : ; Ng. In that case, we might prefer to extract theseproduct features, and work in the feature space F of all products of d entries.In visual recognition problems, where images are often represented as vectors,this would amount to extracting features which are products of individual pixels.For instance, in R2 , we can collect all monomial feature extractors of degree2 in the nonlinear map � : R2 ! F = R3 (59)([x]1; [x]2) 7! ([x]21; [x]22; [x]1[x]2): (60)This approach works �ne for small toy examples, but it fails for realisticallysized problems: for N -dimensional input patterns, there existNF = (N + d� 1)!d!(N � 1)! (61)di�erent monomials (58), comprising a feature space F of dimensionality NF .For instance, already 16� 16 pixel input images and a monomial degree d = 5yield a dimensionality of 1010.In certain cases described below, there exists, however, a way of computingdot products in these high-dimensional feature spaces without explicitely map-ping into them: by means of kernels nonlinear in the input space RN . Thus, ifthe subsequent processing can be carried out using dot products exclusively, weare able to deal with the high dimensionality.The following section describes how dot products in polynomial featurespaces can be computed e�ciently.7.2 Polynomial Feature Spaces Induced by KernelsIn order to compute dot products of the form (�(x) � �(x0)), we employ kernelrepresentations of the formk(x; x0) = (�(x) � �(x0)); (62)which allow us to compute the value of the dot product in F without having tocarry out the map �. This method was used by [8] to extend the GeneralizedPortrait hyperplane classi�er of [31] to nonlinear Support Vector machines. In[1], F is termed the linearization space, and used in the context of the potentialfunction classi�cation method to express the dot product between elements ofF in terms of elements of the input space.What does k look like for the case of polynomial features? We start bygiving an example [28] for N = d = 2. For the mapC2 : ([x]1; [x]2) 7! ([x]21; [x]22; [x]1[x]2; [x]2[x]1); (63)dot products in F take the form(C2(x) � C2(x0)) = [x]21[x0]21 + [x]22[x0]22 + 2[x]1[x]2[x0]1[x0]2 = (x � x0)2; (64)16

References

Page 236: A quick introduction to Statistical Learning Theory and Support Vector Machines

i.e. the desired kernel k is simply the square of the dot product in input space.The same works for arbitrary N; d 2 N [8]: as a straightforward generalizationof a result proved in the context of polynomial approximation [16, Lemma 2.1],we have:Proposition 1 De�ne Cd to map x 2 RN to the vector Cd(x) whose entriesare all possible d-th degree ordered products of the entries of x. Then the corre-sponding kernel computing the dot product of vectors mapped by Cd isk(x; x0) = (Cd(x) � Cd(x0)) = (x � x0)d: (65)Proof. We directly compute(Cd(x) � Cd(x0)) = NXj1;:::;jd=1[x]j1 � : : : � [x]jd � [x0]j1 � : : : � [x0]jd (66)= 0@ NXj=1[x]j � [x0]j1Ad = (x � x0)d: (67)Instead of ordered products, we can use unordered ones to obtain a map�d which yields the same value of the dot product. To this end, we have tocompensate for the multiple occurence of certain monomials in Cd by scalingthe respective entries of �d with the square roots of their numbers of occurence.Then, by this de�nition of �d, and (65),(�d(x) ��d(x0)) = (Cd(x) � Cd(x0)) = (x � x0)d: (68)For instance, if n of the ji in (58) are equal, and the remaining ones are di�erent,then the coe�cient in the corresponding component of �d is p(d� n+ 1)! [forthe general case, cf. 23]. For �2, this simply means that [28]�2(x) = ([x]21; [x]22;p2 [x]1[x]2): (69)If x represents an image with the entries being pixel values, we can usethe kernel (x � x0)d to work in the space spanned by products of any d pixels |provided that we are able to do our work solely in terms of dot products, withoutany explicit usage of a mapped pattern �d(x). Using kernels of the form (65), wetake into account higher-order statistics without the combinatorial explosion (cf.(61)) of time and memory complexity which goes along already with moderatelyhigh N and d.To conclude this section, note that it is possible to modify (65) such that itmaps into the space of all monomials up to degree d, de�ning [28]k(x; x0) = ((x � x0) + 1)d: (70)17

References

Page 237: A quick introduction to Statistical Learning Theory and Support Vector Machines

8 Representing Similarities in Linear SpacesIn what follows, we will look at things the other way round, and start with thekernel. Given some kernel function, can we construct a feature space such thatthe kernel computes the dot product in that feature space? This question hasbeen brought to the attention of the machine learning community by [1, 8, 28].In functional analysis, the same problem has been studied under the heading ofHilbert space representations of kernels. A good monograph on the functionalanalytic theory of kernels is [5]; indeed, a large part of the material in the presentsection is based on that work.There is one more aspect in which this section di�ers from the previous one:the latter dealt with vectorial data. The results in the current section, in con-trast, hold for data drawn from domains which need no additional structureother than them being nonempty sets X . This generalizes kernel learning algo-rithms to a large number of situations where a vectorial representation is notreadily available [17, 12, 34].We start with some basic de�nitions and results.De�nition 2 (Gram matrix) Given a kernel k and patterns x1; : : : ; xm 2 X ,the m�m matrix K := (k(xi; xj))ij (71)is called the Gram matrix (or kernel matrix) of k with respect to x1; : : : ; xm.De�nition 3 (Positive matrix) An m�m matrix Kij satisfyingXi;j ci�cjKij � 0 (72)for all ci 2 C is called positive.1De�nition 4 ((Positive de�nite) kernel) Let X be a nonempty set. A func-tion k : X � X ! C which for all m 2 N; xi 2 X gives rise to a positive Grammatrix is called a positive de�nite kernel. Often, we shall refer to it simply asa kernel.The term kernel stems from the �rst use of this type of function in the studyof integral operators. A function k which gives rise to an operator T via(Tf)(x) = ZX k(x; x0)f(x0) dx0 (73)is called the kernel of T . One might argue that the term positive de�nite kernelis slightly misleading. In matrix theory, the term de�nite is usually used todenote the case where equality in (72) only occurs if c1 = : : : = cm = 0. Simplyusing the term positive kernel, on the other hand, could be confused with akernel whose values are positive. In the literature, a number of di�erent terms1The bar in �cj denotes complex conjugation.18

References

Page 238: A quick introduction to Statistical Learning Theory and Support Vector Machines

are used for positive de�nite kernels, such as reproducing kernel, Mercer kernel,or support vector kernel.The de�nitions for (positive de�nite) kernels and positive matrices di�er onlyin the fact that in the former case, we are free to choose the points on whichthe kernel is evaluated.Positive de�nitness implies positivity on the diagonal,k(x1; x1) � 0 for all x1 2 X ; (74)(use m = 1 in (72)), and symmetry, i.e.k(xi; xj) = k(xj ; xi): (75)Note that in the complex-valued case, our de�nition of symmetry includes com-plex conjugation, depicted by the bar. The de�nition of symmetry of matricesis analogous, i.e. Kij = �Kji.Obviously, real-valued kernels, which are what we will mainly be concernedwith, are contained in the above de�nition as a special case, since we did notrequire that the kernel take values in C n R. However, it is not su�cient torequire that (72) hold for real coe�cients ci. If we want to get away with realcoe�cients only, we additionally have to require that the kernel be symmetric,k(xi; xj) = k(xj ; xi): (76)It can be shown that whenever k is a (complex-valued) positive de�nite kernel,its real part is a (real-valued) positive de�nite kernel.Kernels can be regarded as generalized dot products. Indeed, any dot prod-uct can be shown to be a kernel; however, linearity does not carry over fromdot products to general kernels. Another property of dot products, the Cauchy-Schwarz inequality, does have a natural generalization to kernels:Proposition 5 If k is a positive de�nite kernel, and x1; x2 2 X , thenjk(x1; x2)j2 � k(x1; x1) � k(x2; x2): (77)Proof. For sake of brevity, we give a non-elementary proof using some basicfacts of linear algebra. The 2 � 2 Gram matrix with entries Kij = k(xi; xj) ispositive. Hence both its eigenvalues are nonnegative, and so is their product,K's determinant, i.e.0 � K11K22 �K12K21 = K11K22 �K12 �K12 = K11K22 � jK12j2: (78)Substituting k(xi; xj) for Kij , we get the desired inequality.We are now in a position to construct the feature space associated with akernel k. 19

References

Page 239: A quick introduction to Statistical Learning Theory and Support Vector Machines

We de�ne a map from X into the space of functions mapping X into C ,denoted as C X , via � : X ! C Xx 7! k(:; x): (79)Here, �(x) = k(:; x) denotes the function that assigns the value k(x0; x) tox0 2 X .We have thus turned each pattern into a function on the domain X . In asense, a pattern is now represented by its similarity to all other points in theinput domain X . This seems a very rich representation, but it will turn out thatthe kernel allows the computation of the dot product in that representation.We shall now construct a dot product space containing the images of theinput patterns under �. To this end, we �rst need to endow it with the linearstructure of a vector space. This is done by forming linear combinations of theform f(:) = mXi=1 �ik(:; xi): (80)Here, m 2 N, �i 2 C and xi 2 X are arbitrary.Next, we de�ne a dot product between f and another functiong(:) = m0Xj=1 �jk(:; x0j) (81)(m0 2 N, �j 2 C and x0j 2 X ) ashf; gi := mXi=1 m0Xj=1 ��i�jk(xi; x0j): (82)To see that this is well-de�ned, although it explicitly contains the expansioncoe�cients (which need not be unique), note thathf; gi = m0Xj=1 �jf(x0j); (83)using k(x0j ; xi) = k(xi; x0j). The latter, however, does not depend on the partic-ular expansion of f . Similarly, for g, note thathf; gi = mXi=1 ��ig(xi): (84)The last two equations also show that h:; :i is antilinear in the �rst argumentand linear in the second one. It is symmetric, as hf; gi = hg; fi. Moreover, givenfunctions f1; : : : ; fn, and coe�cients 1; : : : ; n 2 C , we havenXi;j=1 � i jhfi; fji = *Xi ifi;Xi ifi+ � 0; (85)20

References

Page 240: A quick introduction to Statistical Learning Theory and Support Vector Machines

hence h:; :i is actually a positive de�nite kernel on our function space.For the last step in proving that it even is a dot product, we will use thefollowing interesting property of �, which follows directly from the de�nition:for all functions (80), we have hk(:; x); fi = f(x) (86)| k is the representer of evaluation. In particular,hk(:; x); k(:; x0)i = k(x; x0): (87)By virtue of these properties, positive kernels k are also called reproducing ker-nels [3, 5, 33, 17].By (86) and Proposition 5, we havejf(x)j2 = jhk(:; x); fij2 � k(x; x) � hf; fi: (88)Therefore, hf; fi = 0 directly implies f = 0, which is the last property that wasleft to prove in order to establish that h:; :i is a dot product.One can complete the space of functions (80) in the norm corresponding tothe dot product, i.e. add the limit points of sequences that are convergent inthat norm, and thus gets a Hilbert space H , usually called a reproducing kernelHilbert space.2The case of real-valued kernels is included in the above; in that case, H canbe chosen as a real Hilbert space.9 Examples of KernelsBesides (65), [8] and [28] suggest the usage of Gaussian radial basis functionkernels [1] k(x; x0) = exp��kx� x0k22 �2 � (89)and sigmoid kernels k(x; x0) = tanh(�(x � x0) + �): (90)Note that all these kernels have the convenient property of unitary invari-ance, i.e. k(x; x0) = k(Ux;Ux0) if U> = U�1 (if we consider complex numbers,then U� instead of U> has to be used).The radial basis function kernel additionally is translation invariant. More-over, as it satis�es k(x; x) = 1 for all x 2 X , each mapped example has unitlength, k�(x)k = 1. In addition, as k(x; x0) > 0 for all x; x0 2 X , all points lieinside the same orthant in feature space. To see this, recall that for unit lenghtvectors, the dot product (3) equals the cosine of the enclosed angle. Hencecos(\(�(x);�(x0))) = (�(x) ��(x0)) = k(x; x0) > 0; (91)2A Hilbert space H is de�ned as a complete dot product space. Completeness means thatall sequences in H which are convergent in the norm corresponding to the dot product willactually have their limits in H, too. 21

References

Page 241: A quick introduction to Statistical Learning Theory and Support Vector Machines

which amounts to saying that the enclosed angle between any two mappedexamples is smaller than �=2.The examples given so far apply to the case of vectorial data. Let us at leastgive one example where X is not a vector space.Example 6 (Similarity of probabilistic events) If A is a �-algebra, andP a probability measure on A, thenk(A;B) = P (A \ B)� P (A)P (B) (92)is a positive de�nite kernel.Further examples include kernels for string matching, as proposed by [34, 12].10 Representating Dissimilarities in Linear SpacesWe now move on to a larger class of kernels. It is interesting in several regards.First, it will turn out that some kernel algorithms work with this larger classof kernels, rather than only with positive de�nite kernels. Second, their rela-tionship to positive de�nite kernels is a rather interesting one, and a numberof connections between the two classes provide understanding of kernels in gen-eral. Third, they are intimately related to a question which is a variation on thecentral aspect of positive de�nite kernels: the latter can be thought of as dotproducts in feature spaces; the former, on the other hand, can be embedded asdistance measures arising from norms in feature spaces.The following de�nition di�ers only in the additional constraint on the sumof the ci from De�nition 3.De�nition 7 (Conditionally positive matrix) A symmetric m�m matrixKij (m � 2) satisfying mXi;j=1 ci�cjKij � 0 (93)for all ci 2 C with mXi=1 ci = 0 (94)is called conditionally positive.De�nition 8 (Conditionally positive de�nite kernel) Let X be a nonemp-ty set. A function k : X � X ! C which for all m � 2; xi 2 X gives rise toa conditionally positive Gram matrix is called a conditionally positive de�nitekernel.The de�nitions for the real-valued case look exactly the same. Note thatsymmetry is required, also in the complex case. Due to the additional constrainton the coe�cients ci, it does not follow automatically anymore.22

References

Page 242: A quick introduction to Statistical Learning Theory and Support Vector Machines

It is trivially true that whenever k is positive de�nite, it is also conditionallypositive de�nite. However, the latter is strictly weaker: if k is conditionallypositive de�nite, and b 2 C , then k+ b is also conditionally positive de�nite. Tosee this, simply apply the de�nition to get, for Pi ci = 0,Xi;j ci�cj(k(xi; xj) + b) =Xi;j ci�cjk(xi; xj) + b �����Xi ci�����2 =Xi;j ci�cjk(xi; xj) � 0:(95)A standard example of a conditionally positive de�nite kernel which is notpositive de�nite is k(x; x0) = �kx� x0k2; (96)where x; x0 2 X , and X is a dot product space.To see this, simply compute, for some pattern set x1; : : : ; xm,Xi;j cicjk(xi; xj) = �Xi;j cicjkxi � xjk2 (97)= �Xi;j cicj �kxik2 + kxjk2 � 2(xi � xj)�= �Xi ciXj cjkxjk2 �Xj cjXi cikxik2 + 2Xi;j cicj(xi � xj)= 2Xi;j cicj(xi � xj) � 0; (98)where the last line follows from (94) and the fact that k(x; x0) = (x � x0) is apositive de�nite kernel. Note that without (94), (97) can also be negative (e.g.,put c1 = : : : = cm = 1), hence the kernel is not a positive de�nite one.Without proof, we add that in fact,k(x; x0) = �kx� x0k� (99)is conditionally positive de�nite for 0 < � � 2.Let us consider the kernel (96), which can be considered the canonical con-ditionally positive kernel on a dot product space, and see how it is related tothe dot product. Clearly, the distance induced by the norm is invariant undertranslations, i.e. kx� x0k = k(x� x0)� (x0 � x0)k (100)for all x; x0; x0 2 X . In other words, even complete knowledge of kx � x0k forall x; x0 2 X would not completely determine the underlying dot product, thereason being that the dot product is not invariant under translations. Therefore,one needs to �x an origin x0 when going from the distance measure to the dotproduct. To this end, we need to write the dot product of x�x0 and x0�x0 interms of distances:((x� x0) � (x0 � x0)) = (x � x0) + kx0k2 + (x � x0) + (x0 � x0)= 12 ��kx� x0k2 + kx� x0k2 + kx0 � x0k2� (101)23

References

Page 243: A quick introduction to Statistical Learning Theory and Support Vector Machines

By construction, this will always result in a positive de�nite kernel: the dotproduct is a positive de�nite kernel, and we have only translated the inputs.We have thus established the connection between (96) and a class of positivede�nite kernels corresponding to the dot product in di�erent coordinate systems,related to each other by translations. In fact, a similar connection holds for awide class of kernels:Proposition 9 Let x0 2 X , and let k be a symmetric kernel on X �X , satis-fying k(x0; x0) � 0. Then~k(x; x0) := k(x; x0)� k(x; x0)� k(x0; x0) (102)is positive de�nite if and only if k is conditionally positive de�nite.This result can be generalized to k(x0; x0) < 0. In this case, we simply needto add k(x0; x0) on the right hand side of (102). This is necessary, for otherwise,we would have ~k(x0; x0) < 0, contradicting (74). Without proof, we state thatit is also su�cient.Using this result, one can prove another interesting connection between thetwo classes of kernels:Proposition 10 A kernel k is conditionally positive de�nite if and only ifexp(tk) is positive de�nite for all t > 0.Positive de�nite kernels of the form exp(tk) (t > 0) have the interestingproperty that their n-th root (n 2 N) is again a positive de�nite kernel. Suchkernels are called in�nitely divisible. One can show that, disregarding sometechnicalities, the logarithm of an in�nitely divisible positive de�nite kernelmapping into R+0 is a conditionally positive de�nite kernel.Conditionally positive de�nite kernels are a natural choice whenever we aredealing with a translation invariant problem, such as the support vector ma-chine: maximization of the margin of separarion between two classes of data isindependent of the origin's position. This can be seen from the dual optimiza-tion problem (36): the constraintPmi=1 �iyi = 0 projects out the same subspaceas (94) in the de�nition of conditionally positive matrices [17, 23].We have seen that positive de�nite kernels and conditionally positive de�nitekernels are closely related to each other. The former can be represented as dotproducts in Hilbert spaces. The latter, it turns out, essentially correspond todistance measures associated with norms in Hilbert spaces:Proposition 11 Let k be a real-valued conditionally positive de�nite kernel onX , satisfying k(x; x) = 0 for all x 2 X . Then there exists a Hilbert space H ofreal-valued functions on X , and a mapping � : X ! H, such thatk(x; x0) = �k�(x)��(x0)k2: (103)There exist generalizations to the case where k(x; x) 6= 0 and where k maps intoC . In these cases, the representation looks slightly more complicated.24

References

Page 244: A quick introduction to Statistical Learning Theory and Support Vector Machines

The signi�cance of this proposition is that using conditionally positive def-inite kernels, we can thus generalize all algorithms based on distances to cor-responding algorithms operating in feature spaces. This is an analogue of thekernel trick for distances rather than dot products, i.e. dissimilarities ratherthan similarities.Acknowledgements. Thanks to A. Smola and R. Williamson for discussions,and to C. Watkins for pointing out, in his NIPS'99 SVM workshop talk, thatdistances and dot products di�er in the way they deal with the origin.References[1] M. A. Aizerman, E. M. Braverman, and L. I. Rozono�er. Theoretical foundationsof the potential function method in pattern recognition learning. Automation andRemote Control, 25:821{837, 1964.[2] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale{sensitive Dimen-sions, Uniform Convergence, and Learnability. Journal of the ACM, 44(4):615{631, 1997.[3] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337{404, 1950.[4] P. L. Bartlett and J. Shawe-Taylor. Generalization performance of support vectormachines and other pattern classi�ers. In B. Sch�olkopf, C. J. C. Burges, and A. J.Smola, editors, Advances in Kernel Methods | Support Vector Learning, pages43{54, Cambridge, MA, 1999. MIT Press.[5] C. Berg, J.P.R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups.Springer-Verlag, New York, 1984.[6] D. P. Bertsekas. Nonlinear Programming. Athena Scienti�c, Belmont, MA, 1995.[7] V. Blanz, B. Sch�olkopf, H. B�ultho�, C. Burges, V. Vapnik, and T. Vetter. Com-parison of view-based object recognition algorithms using realistic 3D models. InC. von der Malsburg, W. von Seelen, J. C. Vorbr�uggen, and B. Sendho�, editors,Arti�cial Neural Networks | ICANN'96, pages 251 { 256, Berlin, 1996. SpringerLecture Notes in Computer Science, Vol. 1112.[8] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimalmargin classi�ers. In D. Haussler, editor, Proceedings of the 5th Annual ACMWorkshop on Computational Learning Theory, pages 144{152, Pittsburgh, PA,July 1992. ACM Press.[9] C. J. C. Burges and B. Sch�olkopf. Improving the accuracy and speed of supportvector learning machines. In M. Mozer, M. Jordan, and T. Petsche, editors,Advances in Neural Information Processing Systems 9, pages 375{381, Cambridge,MA, 1997. MIT Press.[10] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 {297, 1995.[11] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networksarchitectures. Neural Computation, 7(2):219{269, 1995.[12] D. Haussler. Convolutional kernels on discrete structures. Technical Report25

References

Page 245: A quick introduction to Statistical Learning Theory and Support Vector Machines

UCSC-CRL-99-10, Computer Science Department, University of California atSanta Cruz, 1999.[13] J. Mercer. Functions of positive and negative type and their connection with thetheory of integral equations. Philos. Trans. Roy. Soc. London, A 209:415{446,1909.[14] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for sup-port vector machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson, edi-tors, Neural Networks for Signal Processing VII | Proceedings of the 1997 IEEEWorkshop, pages 276 { 285, New York, 1997. IEEE.[15] J. Platt. Fast training of support vector machines using sequential minimal op-timization. In B. Sch�olkopf, C. J. C. Burges, and A. J. Smola, editors, Advancesin Kernel Methods | Support Vector Learning, pages 185{208, Cambridge, MA,1999. MIT Press.[16] T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201{209, 1975.[17] B. Sch�olkopf. Support Vector Learning. R. Oldenbourg Verlag, M�unchen, 1997.Doktorarbeit, TU Berlin.[18] B. Sch�olkopf, C. Burges, and V. Vapnik. Extracting support data for a given task.In U. M. Fayyad and R. Uthurusamy, editors, Proceedings, First InternationalConference on Knowledge Discovery & Data Mining, Menlo Park, 1995. AAAIPress.[19] B. Sch�olkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods |Support Vector Learning. MIT Press, Cambridge, MA, 1999.[20] B. Sch�olkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson. Esti-mating the support of a high-dimensional distribution. TR MSR 99 - 87, MicrosoftResearch, Redmond, WA, 1999.[21] B. Sch�olkopf, A. Smola, and K.-R. M�uller. Nonlinear component analysis as akernel eigenvalue problem. Neural Computation, 10:1299{1319, 1998.[22] B. Sch�olkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vectoralgorithms. Neural Computation, 12:1083 { 1121, 2000.[23] A. Smola, B. Sch�olkopf, and K.-R. M�uller. The connection between regularizationoperators and support vector kernels. Neural Networks, 11:637{649, 1998.[24] A. J. Smola and B. Sch�olkopf. On a kernel{based method for pattern recognition,regression, approximation and operator inversion. Algorithmica, 22:211{231, 1998.[25] A. J. Smola and B. Sch�olkopf. A tutorial on support vector regression. Neuro-COLT Technical Report NC-TR-98-030, Royal Holloway College, University ofLondon, UK, 1998.[26] A.J. Smola, P.L. Bartlett, B. Sch�olkopf, and D. Schuurmans. Advances in LargeMargin Classi�ers. MIT Press, Cambridge, MA, 2000.[27] V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian].Nauka, Moscow, 1979. (English translation: Springer Verlag, New York, 1982).[28] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.[29] V. Vapnik. Statistical Learning Theory. Wiley, N.Y., 1998.[30] V. Vapnik and A. Chervonenkis. A note on one class of perceptrons. Automationand Remote Control, 25, 1964. 26

References

Page 246: A quick introduction to Statistical Learning Theory and Support Vector Machines

[31] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian].Nauka, Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis,Theorie der Zeichenerkennung, Akademie{Verlag, Berlin, 1979).[32] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method.Automation and Remote Control, 24, 1963.[33] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSFRegional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.[34] C. Watkins. Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett,B. Sch�olkopf, and D. Schuurmans, editors, Advances in Large Margin Classi�ers,pages 39 { 50, Cambridge, MA, 2000. MIT Press.[35] R. C. Williamson, A. J. Smola, and B. Sch�olkopf. Generalization performance ofregularization networks and support vector machines via entropy numbers of com-pact operators. Technical Report 19, NeuroCOLT, http://www.neurocolt.com,1998. Accepted for publication in IEEE Transactions on Information Theory.

27

References

Page 247: A quick introduction to Statistical Learning Theory and Support Vector Machines

Duality and Geometry in SVM Classifiers

Kristin P. Bennett [email protected]

Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA

Erin J. Bredensteiner [email protected]

Department of Mathematics, University of Evansville, Evansville, IN 47722 USA

Abstract

We develop an intuitive geometric interpre-tation of the standard support vector ma-chine (SVM) for classification of both linearlyseparable and inseparable data and providea rigorous derivation of the concepts behindthe geometry. For the separable case findingthe maximum margin between the two sets isequivalent to finding the closest points in thesmallest convex sets that contain each class(the convex hulls). We now extend this ar-gument to the inseparable case by using areduced convex hull reduced away from out-liers. We prove that solving the reduced con-vex hull formulation is exactly equivalent tosolving the standard inseparable SVM for ap-propriate choices of parameters. Some addi-tional advantages of the new formulation arethat the effect of the choice of parametersbecomes geometrically clear and that the for-mulation may be solved by fast nearest pointalgorithms. By changing norms these argu-ments hold for both the standard 2-norm and1-norm SVM.

1. Introduction

Support vector machines (SVM) are a very robustmethodology for inference with minimal parameterchoices. This should translate into the popular adap-tation of SVM in many application domains by non-SVM experts. The popular success of prior method-ologies like neural networks, genetic algorithms, anddecision trees was enhanced by the intuitive motiva-tion of these approaches, that in some sense enhancedthe end users ability to develop applications indepen-dently and have a sense of confidence in the results.How do you sell a SVM to a consulting client, man-ager, etc? What quick description would allow an enduser to grasp the fundamentals of SVM necessary for

a successful application? There are three key ideasneeded to understand SVM: maximizing margins, thedual formulation, and kernels. Most people intuitivelygrasp the idea that maximizing margins should helpimprove generalization. But changing from the pri-mal to dual formulation is typically black magic forthose uninitiated in duality theory. Duality is reallythe key concept frequently missing in the understand-ing of SVM.

In this paper we provide an intuitive geometric expla-nation of SVM for classification from the dual perspec-tive along with a mathematically rigorous derivationof the ideas behind the geometry. We begin with anexplanation of the geometry of SVM based on the ideaof convex hulls. For the separable case, this geomet-ric explanation has existed in various forms (Vapnik,1996; Mangasarian, 1965; Keerthi et al., 1999; Ben-nett & Bredensteiner, in press). The new contributionis the adaptation of the convex hull argument for theinseparable case to the most commmonly used 2-normand 1-norm soft margin SVM. The primal form result-ing from this argument can be regarded as an espe-cially elegant minor variant of the ν-SVM formulation(Scholkopf et al., 2000) or a soft margin form of theMSM method (Mangasarian, 1965). Related geomet-ric ideas for the ν-SVM formulation were developedindependently by Crisp and Burges (1999).

The primary contributions of this paper are:

• A simple intuitive explanation of SVM based on(reduced) convex hulls that allows nonexperts tograsp geometrically the main concepts of SVM.

• A new primal maximum (soft) margin SVM for-mulation that has as it’s dual the problem of find-ing the nearest neighbors in the (reduced) con-vex hulls. Major benefits of this formulation arethat the effects of the misclassification parame-ter choice are very clear and that it is amenableto solution with very fast closest points in poly-

References

UPCLab2013
Typewritten text
REF [8]
Page 248: A quick introduction to Statistical Learning Theory and Support Vector Machines

tope algorithms (Keerthi et al., 1999) and a minorvariant of sequential minimal optimization (SMO)(Platt, 1998).

• Proof of the equivalence, for appropriate choicesof parameters, between the primal and dual formsof the reduced-convex-hull SVM to the primal anddual forms of the classic SVM.

• Extensions of the reduced convex hull argumentsto the sparse 1-norm SVM formulation and a newinfinity-norm SVM.

For compactness, we adopt matrix notation instead ofthe more typical summation notation. In particular,for a column vector x in the n-dimensional real spaceRn, xi denotes the ith component of x. The notationA ∈ Rm×n will signify a real m × n matrix. For sucha matrix, Ai will denote the ith row. The transposeof x and A are denoted x′ and A′ respectively. Thedot product of two vectors x and w is denoted by x′w.A vector of ones in a space of arbitrary dimension isdenoted by e. The scalar 0 and a vector of zeros areboth represented by 0. Thus, for x ∈ Rm, x > 0implies that xi > 0 for i = 1, . . . , m. In general, forx, y ∈ Rm, x > y implies that xi > yi for i = 1, . . . , m.Similarly, x ≥ y implies that xi ≥ yi for i = 1, . . . , m.

Several norms are used. The 1-norm of x,

m∑

i=1

|xi|, is

denoted by ‖x‖1. The 2-norm or Euclidean norm of x,√

m∑

i=1

x2i =

√x′x, is denoted by ‖x‖ and ‖x‖2

= x′x.

The infinity-norm of x, maxi=1,... ,m|xi| is denoted by‖x‖∞.

2. Geometric Intuition: Separable Case

Assume that we are trying to construct a linear dis-criminant to separate two separable sets A and B.Specifically, this linear discriminant is the plane x′w =

γ, where w is the normal of the plane and |γ|‖w‖ is the

Euclidean distance of the plane from the origin. Letthe coordinates of the points in A be given by the m

rows of the m×n matrix A. Let the coordinates of thepoints in B be given by the k rows of the k×n matrixB. We say that the sets are linearly separable if w

and γ exist such that: Aw > eγ and Bw < eγ wheree is a vector of ones of appropriate dimension.

Figure 1 shows two such separable sets and two of theinfinitely many possible planes that separate the setswith 100% accuracy. Which separable plane is prefer-able? With no other knowledge of the data, mostpeople will prefer the solid line because it is further

Class B Class A

Figure 1. Which plane is best?

x0w = Class AClass B

w d cFigure 2. The two closest points of the convex hulls deter-

mine the separating plane.

from each of the sets. In the case of the dotted line,small changes in the data will produce misclassifica-tion errors. So an intuitive idea would be to constructthe plane that maximizes the minimum distance fromthe plane to each set. In fact we know this intuitioncoincides with the results in statistical learning the-ory (Vapnik, 1996) and is substantiated by results inShawe-Taylor et al. (1998).

One way to construct the plane as far as possible fromboth sets is to construct the smallest convex sets thatcontain all the data in each class (i.e. the convex hull)and find the closest points in those sets. Then, con-struct the line segment between the two points. Theplane, orthogonal to the line segment, that bisects theline segment is chosen to be the separating plane. See,for example, Figure 2. The smallest convex set con-taining a set of points is called a convex hull. Theconvex hulls of A and B are shown with dashed lines.The convex hull consists of all points that can be writ-ten as a convex combination of the points in the orig-

References

Page 249: A quick introduction to Statistical Learning Theory and Support Vector Machines

x0w = (� + �)=2Class AClass B

x0w = � x0w = �Figure 3. The primal problem maximizes the distance be-

tween two parallel supporting planes.

inal set. A convex combination of points is a posi-tive weighted combination where the weights sum toone, e.g. a convex combination c of points in A isdefined by c′ = u1A1 + u2A2 + . . . + umAm = u′A

where u ∈ Rm, u ≥ 0, and

m∑

i=1

ui = e′u = 1 and

a convex combination d of points in B is definedby d′ = v1B1 + v2B2 + . . . + vkBk = v′B wherev ∈ Rk, v ≥ 0, and e′v = 1.

The problem of finding the two closest points in theconvex hulls can be written as an optimization problem(C-Hull):

minu,v

12 ‖A′u − B′v‖2

s.t. e′u = 1 e′v = 1 u ≥ 0 v ≥ 0(1)

The linear discriminant, x′w = γ, is constructed fromthe results of C-Hull (1). The normal w is exactlythe vector between the two closest points in the con-vex hulls. Let u and v be an optimal solution of(1). The normal of the plane is the difference be-tween the closest points, c = A′u and d = B′v.Thus w = c − d = A′u − B′v. The threshold, γ,is the distance from the origin to the point halfwaybetween the two closest points along the normal w:

γ = ( c+d2 )′w = (u′Aw+v′Bw)

2 .

There is an alternative approach to finding the bestseparating plane. Consider a set of parallel support-ing planes as in Figure 3. These planes are positionedso that all the points in A satisfy x′w ≥ α and at leastone point in A lies on the plane x′w = α. Similarly,all points in B satisfy x′w ≤ β and at least one pointin B lies on the plane x′w = β. The optimal sepa-rating plane can be found by maximizing the distance

between these two supporting hyperplanes. The dis-tance between the two parallel supporting hyperplanesis α−β

‖w‖. Therefore the distance between the two planes

can be maximized by minimizing ‖w‖ and maximizing(α − β).

The problem of maximizing the distance between thetwo supporting hyperplanes can be written as the fol-lowing optimization problem (C-Margin):

minw,α,β

12 ‖w‖2 − (α − β)

s.t. Aw − αe ≥ 0 − Bw + βe ≥ 0(2)

The final separating plane is the plane halfway be-

tween the two parallel planes: x′w = α+β2

. Note thatthe maximum distance between the supporting planesyields the distance between the two convex hulls. Thetwo closest points for each convex hull must then lie onthe supporting planes. The line segment between thetwo closest points in the convex hulls must be orthogo-nal to the supporting planes, otherwise a contradictionexists. Such a contradiction could be that either thetwo supporting planes are not as far apart as possi-ble or these two points are not the closest points inthe convex hulls. Therefore the solutions of both ap-proaches are exactly the same. This is an example ofduality. As stated later in Theorem 4.1, the dual ofC-Margin (2) is C-Hull (1). See Bennett and Breden-steiner (in press) for the derivation. We can formulateand solve the problem in either space as is convenientfor us. If there is no degeneracy, we will always getthe same plane.

The primal C-Margin (2) and dual C-Hull (1) formu-lations provide a unifying framework for explainingother SVM formulations. By transforming C-Margin(2) into mathematically equivalent optimization prob-lems, different SVM formulations are produced. If weset α − β = 2 by defining α = γ + 1 and β = γ − 1,then Problem (2) becomes the standard primal SVM2-norm formulation (Vapnik, 1996)

minw,γ

12 ‖w‖2

s.t. Aw − (γ + 1)e ≥ 0 − Bw + (γ − 1)e ≥ 0

(3)

In fact, as stated in Theorem 4.2, the classic 2-normSVM (3) and C-Margin (2) are mathematically equiv-alent on separable problems. They will produce theexact same separating plane or an equally good planeif multiple solutions exist (see Burges & Crisp, 1999).

References

Page 250: A quick introduction to Statistical Learning Theory and Support Vector Machines

Class AClass B

Figure 4. The convex hulls of inseparable sets intersect.

3. Geometric Intuition: Inseparable

Case

For inseparable problems, the convex hulls of the twosets will intersect. Consider Figure 4. The difficult-to-classify points of one set will be in the convex hull ofthe other set. In a problem amenable to linear classifi-cation, most points of one class will not be in the con-vex hull of the other. If we could restrict the influenceof outlying points then we could return to the usualconvex hull problem. It is undesirable to let one point,particularly a difficult point, excessively influence thesolution. Therefore, we want the solution to be basedon a lot of points, not just a few bad ones. Say we wantthe solution to depend on at least K points. This canbe done by contracting or reducing the convex hull byputting an upperbound on the multiplier in the convexcombination for each point. The reduced convex hullis defined as follows.

Definition 3.1 (Reduced Convex Hull). The setof all convex combinations c = A′u of points inA where e′u = 1, 0 ≤ u ≤ De, D < 1.

Typically we choose D = 1K

and K > 1. Note thatthe reduced convex hull is nonempty as long as K ≤ m

where m is the number of points in set A.

We reduce our feasible set away from the boundariesof the convex hulls so that no extreme point or noisypoint can excessively influence the solution. In Figure5, the reduced convex hulls with K = 2 are given.Note that the reduced sets no longer intersect. Furtherexamples of reduced convex hulls can be seen in Crispand Burges (1999), who refer to our reduced convexhulls as “soft convex hulls”. We believe that this isa misnomer because softening implies that the convexhulls are expanding but in fact they are being reduced.As we will see later, the concept of reducing the convexhulls to avoid error is the dual concept to enlargingmargins by softening them to allow error. For sets with

Class AClass B

Figure 5. Convex hull and reduced convex hull with K = 2.

lots of redundant points, reducing the convex hull haslittle effect. But for a set with a single outlier the effectis quite marked. Note that for small D the reducedconvex hulls no longer intersect. In general, we willneed to choose K sufficiently large to ensure that theconvex hulls do not intersect. We can now proceed asin the separable case using the reduced convex hullsinstead. We will minimize the distance between thereduced convex hulls so that a few bad points will notdominate the solution.

The problem of finding two closest points in the re-duced convex hulls can be written as an optimizationproblem (RC-Hull):

minu,v

12 ‖A′u − B′v‖2

s.t. e′u = 1 e′v = 1 0 ≤ u ≤ De 0 ≤ v ≤ De

(4)

Immediately we can see the effect of our choice of pa-rameter D = 1

K. Note that each point can contribute

no more than 1K

to the optimal solution. So the so-lution will be robust in some sense since it dependson at least 2K points. If K is too large or converselyD is too small the problem will be infeasible. So K

must be smaller than the number of points in each set.Increasing D larger than 1 will produce no advantageover the solution where D = 1. If we have varyingconfidence in the points or if our classes are skewed insize we can choose different values of D for each pointor class. The reader should consult (Scholkopf et al.,2000) for a more formal derivation of these and addi-tional properties for the ν-SVM formulation which alsohas been shown to solve the closest points between thereduced convex hulls problem (Crisp & Burges, 1999).RC-Hull (4) is suitable for solution by nearest points inconvex polytope algorithms; see (Keerthi et al., 1999).

If we add a soft margin error term to the separableC-Margin Problem (2), we get the following problem

References

Page 251: A quick introduction to Statistical Learning Theory and Support Vector Machines

for the inseparable case (RC-Margin):

minw,ξ,η,α,β

D(e′ξ + e′η) + 12 ‖w‖2 − α + β

s.t. Aw − αe + ξ ≥ 0 ξ ≥ 0−Bw + βe + η ≥ 0 η ≥ 0

(5)

with D = 1K

> 0. As we will prove in Theorem 4.3,the dual of RC-Margin (5) is exactly RC-Hull (4) whichfinds the closest points in the reduced convex hulls.

As in the linearly separable case, one transformationof this problem is to fix α−β by setting α = γ +1 andβ = γ − 1. This results in the classic support vectormachine approach (Vapnik, 1996):

minw,ξ,η,γ

C(e′ξ + e′η) + 12‖w‖2

s.t. Aw − γe + ξ ≥ e

−Bw + γe + η ≥ e

ξ ≥ 0 η ≥ 0

(6)

where C > 0 is a fixed constant. Note that the con-stant C is now different due to an implicit rescalingof the problem. As we will show in Theorem 4.4 theRC-Margin (5) and classic inseparable SVM (6) areequivalent for appropriate choices of C and D.

4. Equivalence to Classic Formulation

We now rigorously examine the claims of the previoussection. We begin with the separable case. For boththe separable and inseparable cases, the theorems es-tablish that the dual of our SVM (soft) maximum mar-gin formulation is exactly the (reduced) convex hullformulation and that our (reduced) convex hull basedSVM formulations are equivalent to the classic SVMform for appropriate choices of parameters. The firsttheorem states that the problem of finding the twoclosest points in the convex hulls of two separable setsis the Wolfe dual (or equivalently Lagrangian dual) ofthe problem of finding the best separating plane.

Theorem 4.1 (Convex Hulls is Dual).The Wolfe dual of C-Margin SVM ( 2) is the closestpoints of the convex hull problem C-Hull ( 1) or :

maxu,v

−12‖A′u − B′v‖2

s.t. e′u = e′v = 1, u ≥ 0, v ≥ 0(7)

Proof of this theorem can be found in full detail in(Bennett & Bredensteiner, in press) or can easily bederived as a variant of the corresponding theorem forthe inseparable case.

Problem C-Margin (2), the primal form of the dualC-Hull of finding the closest two points in the convex

hulls, is equivalent to the classic inseparable 2-normSVM (3) in Vapnik (1996). Specifically, every solutionto one problem can be used to construct a correspond-ing solution to the other by simple scaling. The theo-rem assumes that the degenerate solution w = 0 is notoptimal. This is equivalent to saying that the convexhulls do not intersect. For convex quadratic programswith linear constraints, a solution is optimal if and onlyif it (along with the corresponding Lagrangian multi-pliers) satisifies the Karush-Kuhn-Tucker (KKT) opti-mality conditions of primal feasibility, dual feasibility,and complementary slackness. We call a set of primalC-Margin and dual C-Hull solutions a KKT point. Wecan establish the equivalence of the C-Margin/C-Hullformulations with the classic inseparable SVM formu-lation by showing that a KKT point of one can beused to derive a KKT point of the other. The optimalseparating plane of one solution will also be optimalfor the other form, but the weights and threshold arescaled by a constant.

Theorem 4.2 (Equivalence of Separable Forms).Assume C-Margin ( 2) has a solution with ‖w‖ > 0.Then (w, γ, u, v) is a KKT point of the classic sepa-

rable SVM ( 3) if and only if (w, α, β, u, v) is a KKTpoint of C-Margin ( 2) where δ = e′u = 2

α−β, w =

wδ, α = γ+1

δ, β = γ−1

δ, u = u

δ, and v = v

δ.

Proof. Each KKT point of the classic separable SVM(3) satisfies:

Aw − (γ + 1)e ≥ 0 −Bw + (γ − 1)e ≥ 0u′(Aw − (γ + 1)e) = 0 v′(−Bw + (γ − 1)e) = 0w = A′u − B′v e′u = e′v

u ≥ 0 v ≥ 0.

(8)

Dividing each constraint by δ or δ2 as appropriateyields a KKT point of the C-Margin SVM (2) satis-fying:

Aw − αe ≥ 0 −Bw + βe ≥ 0

u′(Aw − αe) = 0 v′(−Bw + βe) = 0w = A′u − B′v 1 = e′u = e′v

u ≥ 0 v ≥ 0.

(9)

Similarly, multiplying the KKT conditions (9) of C-Margin (2) by δ = 2

α−βor δ2 yields the KKT condi-

tions (8) of the standard separable SVM (3). We know

α − β > 0 because by strong duality the primal anddual objectives will be equal thus

1

2‖w‖2 − α + β = −1

2‖A′u − B′v‖2

< 0.

References

Page 252: A quick introduction to Statistical Learning Theory and Support Vector Machines

The theorems can be directly generalized to the in-separable case based on reduced convex hulls. TheWolfe dual (for example, see Mangasarian, 1969) ofRC-Margin (5) is precisely the closest points in thereduced convex hull problem, RC-Hull (4).

Theorem 4.3 (Reduced Convex Hulls is Dual).The Wolfe dual of the RC-Margin ( 5) is RC-Hull ( 4)or equivalently:

maxu,v

−12 ‖A′u − B′v‖2

s.t. e′u = e′v = 1, De ≥ u ≥ 0, De ≥ v ≥ 0

(10)

Proof. The dual problem maximizes the Lagrangianfunction of (5), L(w, α, β, ξ, η, u, v, r, s), subject tothe constraints that the partial derivatives of the La-grangian with respect to the primal variables are equalto zero (Mangasarian, 1969). Specifically, the dual of(5) is:

maxw,α,β,ξ,η,u,v,r,s

L(w, α, β, ξ, η, u, v, r, s) =

12 ‖w‖2 − α + β + De′ξ + De′η

− u′(Aw − αe + ξ)− v′(−Bw + βe + η) − r′ξ − s′η

s.t. ∂L∂w

= w − A′u + B′v = 0∂L∂α

= −1 + e′u = 0, u ≥ 0∂L∂β

= 1 − e′v = 0, v ≥ 0∂L∂ξ

= De − u = r ≥ 0∂L∂η

= De − v = s ≥ 0

(11)

where α, β ∈ R, w ∈ Rn, ξ, u, r ∈ Rm, and η, v, s ∈Rk. To simplify the problem, substitute in w = (A′u−B′v), r = De − u and s = De − v:

maxα,β,u,v

12 ‖A′u − B′v‖2 − (α − β) + De′ξ + De′η

− u′A(A′u − B′v) + v′B(A′u − B′v)+ e′uα− e′vβ − u′ξ − v′η

− De′ξ − De′η + u′ξ + v′η

s.t. e′u = e′v = 1, De ≥ u ≥ 0, De ≥ v ≥ 0

(12)

and then simplify to yield RC-Hull (10).

Optimizing the reduced-convex-hull form of SVM withparameter D is equivalent to optimizing the classic 2-norm SVM (6) with parameter C. The parameters Dand C are related by multiplication of a constant fac-tor based on the size of the optimal margin. If theappropriate values of D and C are chosen, then onceagain a KKT point of one will be a KKT point of theother. A similar result for the ν-SVM formulation isgiven in Proposition 13 in Scholkopf et al. (2000).

Theorem 4.4 (Equivalence of Inseparable Forms).Assume RC-Margin ( 5) has a solution with ‖w‖ > 0.Then (w, γ, ξ, η, u, v) is a KKT point of the classicinseparable SVM ( 6) with parameter C if and only

if (w, α, β, ξ, η, u, v) is a KKT point of RC-Margin( 5) with parameter D where δ = e′u = 2

α−β, w =

wδ, α = γ+1

δ, β = γ−1

δ, ξ = ξ

δ, η = η

δ, u = u

δ, v =

vδ, and D = C

δ.

Proof. Each KKT point of the classic SVM (6) withparameter C satisfies:

Aw − (γ + 1)e + ξ ≥ 0 w = A′u − B′v

−Bw + (γ − 1)e + η ≥ 0 e′u = e′v

ξ ≥ 0 Ce ≥ u ≥ 0η ≥ 0 Ce ≥ v ≥ 0u′(Aw − (γ + 1)e + ξ) = 0 ξ(Ce− u) = 0v′(−Bw + (γ − 1)e + η) = 0 η(Ce− v) = 0.

(13)

Dividing each constraint by δ or δ2 as appropriateyields a KKT point of the RC-Margin (5) with pa-rameter D satisfying:

Aw − αe + ξ ≥ 0 w = A′u − B′v

−Bw + βe + η ≥ 0 1 = e′u = e′v

ξ ≥ 0 De ≥ u ≥ 0η ≥ 0 De ≥ v ≥ 0

u′(Aw − αe + ξ) = 0 ξ(De − u) = 0

v′(−Bw + βe + η) = 0 η(De − v) = 0.

(14)

Similarly, multiplying the KKT conditions (14) of theRC-Margin SVM (5) with parameter D by δ = 2

α−β

or δ2 yields the KKT conditions (13) of the standard

SVM (6) with parameter C. We know α − β > 0 byequality of the primal and dual objectives

De′ξ + De′η +1

2‖w‖2 − α + β

= −1

2‖A′u − B′v‖2

< 0.

This theorem proves that for appropriate parameterchoice, the solution set of optimal parallel max-marginplanes produced by the classic SVM with parameterC (x′w = γ + 1 and x′w = γ − 1) will also be opti-mal for the reduced-convex-hull problem with param-eter D (x′w = α and x′w = β) using the relation-ship defined above and vice versa. But it is not truethat the sets of final single separating planes producedby the two methods are identical. The plane bisect-ing the closest points in the reduced convex hulls i.e.

References

Page 253: A quick introduction to Statistical Learning Theory and Support Vector Machines

Class B Class A

Figure 6. Optimal plane bisecting the closest points in the

reduced convex hulls.

Class B Class A

Figure 7. Optimal plane bisecting parallel maximum soft

margin planes.

x′w = w′(A′u−B′ v)2

, is parallel to but not identical to

the plane x′w = α+β2 that would also be a solution of

the original SVM problem once scaled. The thresholdsdiffer. This is illustrated by Figures 6 and 7.

Figure 6 gives the solution found by the reduced-convex-hull SVM formulation which finds the two clos-est points in the reduced convex hull and as a heuristicselects the threshold halfway between the points. Butthere is nothing explicit about the choice of thresh-old in the reduced-convex-hull formulation RC-Hull.In Figure 6, the closest points in the reduced convexhull are represented by an open square and open cir-cle. The solution found by the classic SVM is given inFigure 7. The classic SVM formulation assumes thatthe best plane bisects the two parallel margin planes.Note that the plane that bisects the closest points isnearer to Class A. In some sense the plane is shiftedtoward the class in which we have more confidence. Itis not a priori evident which assumption for the choiceof threshold is best. This property was also noted with

the ν-SVM formulation (Crisp & Burges, 1999).

Our reduced-convex-hull SVM formulation differs fromthe ν-SVM formulation in that there are distinct mar-gin thresholds α and β for each class instead of a singlevariable for both. Extensions of the ν-SVM formula-tion using parametric models for the margins are sug-gested in Scholkopf et al. (2000). Similar analysis tothe above can be performed for the ν-SVM. We referreaders to Crisp and Burges (1999) which uses a re-lated but different argument for establishing the cor-respondence of ν-SVM with the reduced-convex-hullformulation. Assuming there exists a unique nonzerosolution to the closest points in the reduced convex hullproblem and appropriate parameter choices are made,the reduced-convex-hull, classic, and ν-SVM will allyield a plane with the same orientation, i.e. w is thesame modulo a positive scaling factor. But they donot produce the exact same final planes because theassumptions used to construct the thresholds differ.

5. Alternative Norm Variations

We have shown that the classical 2-norm SVM formu-lation is equivalent to finding the closest points in thereduced convex hulls. This same explanation works forversions of SVM based on alternative norms. For ex-ample consider the case of finding the closest points inthe reduced convex hulls as measured by the infinity-norm:

minu,v

‖A′u − B′v‖∞s.t. e′u = e′v = 1, De ≥ u ≥ 0, De ≥ v ≥ 0

(15)

One method for converting the problem into a linearprogram (LP) produces:

minu,v,ρ

ρ

s.t. −ρe ≤ A′u − B′v ≤ ρe

e′u = e′v = 1, De ≥ u ≥ 0, De ≥ v ≥ 0

(16)

The dual is

maxw,α,β,ξ,η

α − β − De′ξ − De′η

s.t. Aw − αe + ξ ≥ 0, ξ ≥ 0−Bw + βe + η ≥ 0, η ≥ 0‖w‖1 = 1

(17)

For an appropriate choice of C, this is equivalent tosolving the typical 1-norm SVM

minw,γ,ξ,η

Ce′ξ + Ce′η + ‖w‖1

s.t. Aw − (γ + 1)e + ξ ≥ 0, ξ ≥ 0−Bw + (γ − 1)e + η ≥ 0, η ≥ 0

(18)

References

Page 254: A quick introduction to Statistical Learning Theory and Support Vector Machines

Similarly finding the closest points of the reduced con-vex hulls using the 1-norm is equivalent to construct-ing a SVM regularized using an infinity-norm on w.Specifically, solving the problem

minu,v

‖A′u − B′v‖1

s.t. e′u = e′v = 1, De ≥ u ≥ 0, De ≥ v ≥ 0

(19)

is equivalent to solving (for appropriate choices of D

and C)

minw,γ,ξ,η

Ce′ξ + Ce′η + ‖w‖∞s.t. Aw − (γ + 1)e + ξ ≥ 0, ξ ≥ 0

−Bw + (γ − 1)e + η ≥ 0, η ≥ 0

(20)

Limited space does not allow a full development of thisargument.

6. Conclusion

The simple geometric argument of finding the closestpoints in the convex hulls or reduced convex hulls oftwo classes can be used to derive an intuitive geomet-ric SVM formulation. Users can grasp visually theprimary notions of SVM necessary for successful im-plementation without getting hung up on notions ofduality. The reduced-convex-hull formulation forcesthe optimal solution to depend on more points depend-ing on the parameter D ∈ (0, 1). If D is too large, thereduced convex hulls intersect, and the meaninglesssolution w = 0 results. If D is too small, the dualproblem will be infeasible. We rigorously showed thisformulation is exactly equivalent to the classic SVMformulation for appropriate choices of parameters. As-suming the parameters are well-defined, the solutionsets of the problems are the same modulo a scaling fac-tor dependent on the size of the margin. But the finalchoice of threshold will vary depending on the assump-tions of the user. From an optimization perspectivethe reduced-convex-hull formulations may be prefer-able due to the interpretability of the misclassificationparameter and the availability of fast nearest point inpolytope algorithms (Keerthi et al., 1999). If the 1-norm or infinity-norm is used to measure the closestpoints in the reduced convex hull the analogous analy-sis can be performed showing that the primal problemcorresponds to the SVM regularized with the infinity-norm or 1-norm of w respectively. Thus the reducedconvex hull argument holds for 1-norm SVM linearprogramming approaches.

Acknowledgements

This material is based on research supported by Mi-crosoft Research and NSF Grants 949427 and IIS-9979860.

References

Bennett, K. P., & Bredensteiner, E. J. (in press).Geometry in learning. In C. Gorini et al. (Eds.),Geometry at work. MAA Press. Also available ashttp://www.rpi.edu/∼bennek/geometry2.ps.

Burges, C. J. C., & Crisp, D. J. (1999). Uniqueness ofthe svm solution. Proceedings of Neural InformationProcessing 12. Cambridge, MA: MIT Press.

Crisp, D. J., & Burges, C. J. C. (1999). A geometricinterpretation of ν-svm classifiers. Proceedings ofNeural Information Processing 12. Cambridge, MA:MIT Press.

Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., &Murthy, K. R. K. (1999). A fast iterative nearestpoint algorithm for support vector machine classifierdesign (Technical Report TR-ISL-99-03). IntelligentSystems Labs, Department of Computer Science andAutomation, Indian Institute of Science, Bangalor,India.

Mangasarian, O. L. (1965). Linear and nonlinear sep-aration of patterns by linear programming. Opera-tions Research, 13, 444–452.

Mangasarian, O. L. (1969). Nonlinear programming.New York: McGraw–Hill.

Platt, J. (1998). Fast training of support vector ma-chines using sequential minimal optimization. Ad-vances in kernel methods - Support vector learning.Cambridge, MA: MIT Press.

Scholkopf, B., Smola, A., Williamson, R., & Bartlett,P. (2000). New support vector algorithms. NeuralComputation, 12, 1083–1121.

Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C.,& Anthony, M. (1998). Structural risk minimizationover data-dependent hierarchies. IEEE Transactionson Information Theory, 44, 1926–1940.

Vapnik, V. N. (1996). The nature of statistical learningtheory. New York: Wiley.

References

Page 255: A quick introduction to Statistical Learning Theory and Support Vector Machines

������ ������ �� � � ����������

�������� ����� ���� ����� � �� �� ��

������ ������� � ����������� ������� ����

������������� ���� � ����

���� ����� ���� ����� � �� �� ��

������ ������� � ����������� ������� ����

� � ��� ��� ���� � ����

����� ����������� � ���������

� ��� !�"���������� #$ �%�&'(%)��� !������� ������������

���� ����� ���� ����� � �� �� ��

������ ������� � ����������� ������� ���������� ��� ���� � ����

��������

*� � ����� �� � ����"� �� ���� �����+ �, � � ���� � -��( � ������ .�-�/ � ���0���1 *�� �� �����+ �,� ������(�� ������ ����� ��� ������ �� ��� � � �� � �-� � ���0���� ��� � ������� � ��� � �� � ���� �� ���� ���� �1 # + � ,��� � �-� � ��� � �� ��� ��� ��� ������ �� �� ���� ���������(� �� ����� ��� � � �2���� �3����� �� � �� ����� �(�� � �� � �� ���� ������ �1 � ������ +� �� ��4� ��� �� � �-� � ���0���� +� � �5 ���� �������� � ������ � �(��� �� ��� � ��1 ������� �"��� ��� � �� �� �� � ������� � ���� �� ���� �� ��1 6�� �����+ �, ��� ��,� �� � ���������� �-� � � � � "��� ���� �1 �� ��� �� � � � �� ��� �3� �� �� � ��� � �� ��+��� �-� ��� ��� ������� � ����1

� ���� ����

7�� ���� ���� � �-� ��� "��� ��"� 89� ��:1 �� �� ��� � ���� � ������� ���� ������ � �-� � ������� �� ��� +� �� ��3�� ����� ������ ���1*� �� � �� ��� ��5 ���� � ������� �� � ��, "���� �� ��� �� ����� � +�� ����� � ������ �� " ��� ��� �� ��� � �������� +� �� � (�� ��� � ��3�� �����1 7���� ������� ��� �� ������ �� ������� ��� � � �"�3 ��4� � �� � ���� �� � ���4�� � �"�3 ���� �� ����� ������;�� � � ������1 *�� �� �����+ �, � � ������� ��4� � �� ������� ��4� � �� � ��� ��� ��� � �"���� ��� � ����"� ��� � ����+��� ��� ����� �� � "�� ��� �� ���� �� � ���1 � �� ��� ����� �� ��� �� ��� � ���� � � ����������� �� �������� � �-�1 �� � ��� �

���������� ��� �� �� ���� ��� � � � �� �� ��� ���� ������� ������� �� �������������� �� � ���� ����� ��� !� �����

References

UPCLab2013
Typewritten text
REF [9]
Page 256: A quick introduction to Statistical Learning Theory and Support Vector Machines

� � � ���, ���� � � ���� �� � � ������ �� � �� ���� ��� ������� ��8�:1

<���� ���������� ��"� �� +� �� ���� �3� ��� �� ���� ������� �� � � ����� � ��� � �� � �-� �� ��� ��,� �� � � ����� "��� � �� ,�� ��� 8��)� %:1 �� �� ����� � � �� ������� � ���� 0���� �� ��3��� ����� ��+���+ � ����� � �;�"� �� 0���� �� ������ ����� �� � �� � �"�3 �� � ����� � ���= � � �� �������� � ���� 0���� �� � � ��3��� ����� ��+��� + � ����� � �;�"� �� 0���� �� � �� � �� � �� � (�� �� ������� � �"�3 �� � � ���� � ���1 �� ���� �� ���� �� �� ��� �� "�� +�� �� �����+ �, � ��4� � �� ��1 � �� �� �� ��� � ���, ���� � � ���� �� � �1 � ������� �� �� ���� �3� ��� �� ������ ������ � �� � � ����� � � ���� ��1

�� �� ������ +� � ����� �� � ����"� �� ���� �����+ �, � � �-� � ���0���� "��� �� �� �� ��� ���������� 8�� )� %� >:1 *�� �� �����+ �,� �������������� ����� ��� ������ �� ��� � � �� � �-� � ���0��� � ��� � �������� ��� � �� � ���� �� ���� ���� �1 # + � ,��� � �-� � ��� � �� �� ������ �������(� �� ����� ��� ��� ��� ������ �� ��� � � �2���� �3����� �� � �� ����� ��� � �� � �� ���� ������ �1 � ������ +� �� ��4� ��� �� � �-� � ���0���� +� � �5 ���� �������� � ������ � ���� �� ���� ��1 ������� �"��� ��� � �� �� �� � ����� �� � ���� �� ���� �� ��1� �� �����+ �, ��� ��,� �� � � ��������� �-� � � � � "��� ���� �1�� ��� �� � � � �� ��� �3� �� �� � ��� � �� ��+��� �-� ��� ��� ������ � � ����� ���� �� � ��� 8�:1

7�� ����� � ����4�� �� � +�1 ��� � ) �� ����� � �� ��� ����� ����(���� � ��1 ��� � ' �������� �������(� �� �������� ������ ����1 ��� �% �������� �� ��� �� � ��� � �������(� �� �-� � ���0���1 ?�� �� ��� � &�"������ � � �� � )(� �� �-� � ���0���1 �� �� ����� �� � � � ��� ����� �� ���� �� �1

� ����������

�� �� ��� �� +� �� ���� � �� �������� � ��1 ? � � �� ��� �� � � �������� �� � , 8&� @:1

� � ���� ���� ����� � � "�� � ����� � � +��� ���� � ��0��� � ��� ("� ������� � +��� ���� ���� � ���� � � � � � ��� ������ ���1 7�� � �� ���0���� � +�� �3 ��A�1 ��� � � � � � � � � ���� B � � � � � � B �1)1 �� C �� � ���C ��� � � ���� �� � � � 1'1 ���� B ��� � ��� � � � ��� ��� � ��� ���� � � � 1�� �� � ��� ������ � � � ���� "�� � ������ ��� ������ � �.���/ B �� ������� � �� � � �1

� ����� ���� �� � � � ���� ����� � � ����� � ���� � � � ���� ������ �.�/� � ��� � � � � � � 1 7�� ����� � � � ����� ���� ���� �� � �� � �� �� �� � ���� ��� � � ��� � ��� �� � �1 7�� � �� � � � � ��� B �� A �.�/� � ���� � � � � � �1 D����� �� +� � �� ��� � �� � ����� � �� ��� ��� � �� � � ��� ��� � � �� "� �� � �� ���� �� �� � � � �� � � 1 �� �� � �� ��0� � +� ��"� ��� ��� � �������1

7�� ���� � � � � �������� � ��� �� � ���� ���� ��1 � ������ ��� � �� ���� "�� � ����� � � � ��3�� �� ��� ���� ��� �� �� � ���� ���� ��� ���� �� � B � � ��� � � ��� ���� ���� �� � ����� � � ��� ���� B � � B � 1 D"�� � ������ ��� � � ���� � � ���� ���� �� �� � � ���

References

Page 257: A quick introduction to Statistical Learning Theory and Support Vector Machines

� � ���� � ���� �� � B � A ��� ��� B �1 � �"���� �� � �� � � � �4�� �������� �� � � � �� �� � A ��� ��� B � � � ������ ��� � � 1

� ��� ���� ����� � � ��� ���� "�� � ����� � ����� +� �� ���� �� ����+��� � � ��� �� � � �� � ��� ��� �� �� ��� �� +���� �� � � � � 7�� ������ ��� ���0�� �� � +�� �3 ��A�1 ��� �� B ��� ���)1 �� C �� �� B ��� �� C ��� ���'1 ��� �� B ��� ���%1 ��� �� � � ��� ��� �� B � � ��� � � � � B ��

*� ��� � � ��������� �� � +��� ��� �� B ������ �� � ��� �� �� ��� �������

� ����� � � �� .)(� ��/1 ? � � �� � � �� ���� �� ��� ������ ���� ��� � �������7�� � �� �� ������(���+��4 ��;�� �1 E;�� � � �� � ��� � � � � B � �� B �� � � �� �� ��� ���� ����� � �� �� � ��� F ��� �����1 �� �� � �� � ����� �� ��� ����� �� � � F ��� ����� �� ���� �3�� � ��;�� "�� � � � � ������ � � � � � �� ��� ��� B ��� ��� ?������ ��� +� ��"� ���� B ����

� ���������

�� �� ��� �� +� ������ �������(� �� �������� ������ ����1

������� ��� �� �� �� � � � � � ��� � ���� "�� � ����� � � ��� � � �����(� ��� � A ��� ��� B �� �� � � �1 7��� �� ������ �� � �� � �

� B �����

��� � �� B����� �

�� � ��

�����

E����� �� � B ��� +��� ���� B � ��� � B �1�� �� �0��� � �� �� � ����"�� �� � �� � � �� � � ��� � ���� �

�� � ����� �� � � ���� +� �� � ��� 1�1 ��� � ��� �

�� B ��� � ��������

�����1 ? � ��� ��� � � B � ������ �� ������ ���� ��� ��� �� ������� ������ � ��1 7��� +� ��� �� + ��

� B �����

��� B���

�����

�� � B � C � +���� � � ��3�� �� ��� �������� � � 1 E ����� � � �����;�� � ���������� � � �� � �� � B �� C �� +� � � � 1 �� ��� ��� B �� � �� � B � A ��� ��� B �1 *� ��"�

���� B ������

���� ����

���B ���

����

��� C ��

B �������

����� C ���

B���

�� �� C ����

# � � B �� �� C ���� � � B ��������1

�� �� �0��� � �� �� � ����"�� �� � �� � � �� � � � ��� � ���� ��� � ��

���� �� �� �� "� ���� � +� ��"�

� B ��� � ��� B���� �

�� � �

����

B���� �

�� � ���� ���

����B��� � ��� �

��

�����

� ��� � ��� ��� B ��� � �����

���7�� �� "�� �� �� ���1

References

Page 258: A quick introduction to Statistical Learning Theory and Support Vector Machines

������ � �� F ��� ����� �� � � ��� �� � ��� ���� �3�� � ��;�� "�� � � � ����� �� ��� B ���� ��� ����� B ��� ��� � � � � � �� �� �� �� "� �� ���� ��������� � � � ��� +� ��"� ��� � ��� �� B ��� � ������1 E;�� � � �� � ��� � � � �� � �� B ��� � � �1 � � � ��� � � � � +� ��"� ��� � ����� B ������ B������ B �1 7�� ����� ����� � ��������� �� � 1 7�� ��� � � �,������ ?���� '1�1

X1H

X0

?���� '1� ����� ������A �� � � � � � ������ ���

������ � �� �� ������� � ���� ������ ���� � �� � ��"� � ��;�� � � �� ��� �1 *� � ����� � ��� � + (����� �� �3��� �8:1 �� � �� �� ����� ����� � ��� ������� � B .��� ��/ +� ��� B ��3���� � ����1 �� � �� �� ��������� B � A �� B �� ��� � ����� �� 03�� � � � B .)� �/� 7�� ������ �� � � � � �" �� � �� �� ��� "�� � � � �� ����� � A � � �� � '� �� B � � ����0�� ����� B ��

������ � 7�� �� � � �� ������ ����� �� � � � �� � � �"�31 �� �� ��� ���� + ������ ����� ��� ��� � B .� � /�� C ��� � � � �1 6�" �� � � � � �� ��� � �� � �� �� ��� �� �� �� ���� � ��;�� � +� ��"�

��� � �� B ��� � 8.� � /�� C ��:�

B �.�� /.�� � ��/ C .�� � ��/�

� .�� /��� � ���C ��� � ���

B .�� /� C � B ��

� ��� � �� B �� 1�1� � � �� � �� ������ ����� � � ��1 7�� �� "�� ��������1

������� �� � �� � ��� ������ � � � � � �17�� ����� � � ���� � � ��;������ � ��� ��� ��� ��� ��� � � +���

��

���

���� � ��

7�� � �� � �� � ���� � B �� � � � ��0��� ��

��� B .

��

���

����/

�� �

�� � �� �� � �"� ������ .�������� � � �/ ���� ��

�C

�B ��

References

Page 259: A quick introduction to Statistical Learning Theory and Support Vector Machines

��� �� � �� ��� �3� ��� � �1 7�� ��� ����� � � � � 8&� @:17�� ����� �� � ���� � � ����� ��;������1 7�� � �� � �� � ���� � B �� ���0��� ��

��� B ����

�����

7�� ��� ����� � �� � ��8&� @:1�� �� � �.� � � � �� � � B ��+� �,� � B �/ �� ��� � � +��� � � � �������� ��� � B � A ����� B � +���� � � �� 7��� �� ������ �� � �� �������� ��� � �

� B �����

��� � �� B����� ���

����

$ � � � ��� �� �� ���� �� � ����� � � ���� �� ����� � G�����(G���(7��,�� ��� � ���� �8>� � :1

������� ��� �� � �� � ��� � ���� "�� � ����� ��� � � � � ��� �����1 D"���� + ���� � ������ ���� �� B � A ��� ��� B �� ��� �� B � A ��� ��� B ��+���� �� � � �� ��� �� � �� ��� �� B ��� ��� �� ������ ��+��� �� ��� �� �

� B ����������

��� �� B��� � ���

�����

�� �� �0��� � �� �� � ����"�� �� �� � ��� �� � �� ��� � �� � ��� ��� ��

� � ���� +� �� � ��� 1�1 ��� � ��� ��� B ��� � �����

���

�����1 �� � B � A � B �� � ��� �� � ��� �� � ��1 7��� � B ����� A � � �1# � � B � A ���� ��� B �� � ��� 1�1 � � � � ������ ���1 �� �� ��� '1� +���"� � B ��� � ������

����� �� �0��� � �� �� � ����"�� �� �� � �� ��� �� � ��� 1�1� ���� �

�� B ����� ���� �

�� B ��� ��� � �� � ��� ���

� B ��� � ��� B�� � ������

B���� �

�� � ���� ���

����B��� � ��� �

��

�����

# � � B ��� � ���� +� ��"� ��� � ��� ��� B ��� � �����

���7�� �� "�� �� �� ���1

������� � �� .��� ��/� .��� ��/� � � � � .��� ��/ � ��� � ��� � �� �� ����� ���(� �1 �� � B �� A �� B �� � � � � ��� � � ��� � B �� A �� B ���� � � � ��7���� �3�� � � � ��� .�� ��� ����� � ��� � � ���/��� � ������ ���� �� ����� � C � � � � � � � ��� ����� � � � � � � � � �17��� �� B � A ����� B C ���� �� B � A ����� B � � � ��� ��+ ���� � ������ ����1 �� �� �� ��� �� ������ ��+��� �� ��� �� ��.� C �/ � .� � �/������ B )�����1 � ��3�4�� �� ������ � �� �� + ���� � ������ ���� � � "� �� � +�� ��4� � �� � ��A

����

)�����

������ ��.���� �� C / � � � B �� )� � � � � ��

7�� � �� ������� ���� �-� )(� �� � ��� � �89� ��:1

������� � �� .��� ��/� .��� ��/� � � � � .��� ��/ � � � ��� � �� �� ����� ���� �+���� � � � � �����1 �� � B �� A �� B �� � � � � ��� � � ��� � B �� A �� B���� � � � �� 7���� �3�� � � � ��� ��� B � .� � �� � � � �����/ ��� �

References

Page 260: A quick introduction to Statistical Learning Theory and Support Vector Machines

������ ! � � ���� �� ����� � ! � � � � � � ��� ����� � �! � � � � � �1 ���� B � A ����� B ! ��� �� B � A ����� B �!1 �� �� �� ��� �� ��������+��� �� ��� �� � �!� .�!/����� B )!1 � ��3�4�� �� ������ � ����+ �������� ������ ���� � � "� �� � +�� ��4� � �� � ��A

��3�����

!

������ ��.���� ��/ � ! � B �� )� � � � � �

��� B ��

$ � � � ��� �� �� ���� �� � ����� � � ���� �� ����� � G�����(G���(7��,�� ��� � ���� �8>� �:1

� � �����

�� �� ��� � +� ������ �� ��� �� � ��� � �������(� �� �-� � ���0���1

�� " �� � � �"�3 �� � � ��� � ���� "�� � ����� � � 7�� ���� �� #�.��/ B������ ��� ��� ��0��� � � � � �� �� �� ����� ���� � ���� �� � " ��� �������� ��� � B � A ��� ��� B #�.��/ � ����� ���� � ������ ��� � � "� � �(���� ���� �� �� ���� �� ��.��/ B ����� ��� ��� � �� �� �� +�� ���� ����� �� ��� � B � A ��� ��� B ��.��/ � +�� ���� � ������ ���1

������� ��� .����� # �� ��� �/ �� �� �� � � � � � ��� � ���� "�� ������ � ��� " � � � �"�3 �� ��� � � ��� � � ������ �� � �� � �"�3 �� "=���

� B �����

��� � �� B ��3�����

8���� ��� � #�.��/:

����

+���� �� ��3��� � �� ��� � ����"�� �� � �� ��� � � ���� �� �0��� � �� �� � ����"�� �� � �� � � �� � "� ��� ��� � � ���� +��� � ��� 1�1 ��� � ��� �

��� B ��� � �����

����

������ ? � ��� �� +� �,� �� B � ���� �� ������ ���� ��� ��� �� ������� ������ � �1 7��� +� ��� �� + ��

� B �����

��� B ��3�����

�#�.��/

�����

*� 0�� �� + �� � � ��� �� � � �� +� ��"� � � �#�.��/������ ? � �� +� ��� � � � ����� �� A #.��/ � �� 6�" �� �� � #�.��/ � �� ��� �� ������ ���� B � A ��� ��� B #�.��/ �������� " ��� ���� $.%/ B � A ��� � %� ? � ��� �� � � � +� #�.��/ � �� � %� �� ���������� � �� % � � +��� �� ������ ��� � A ��� ��� B #�.��/ �������� "��� $.%/� 6�" �� �� +� ��"� #�.��/ B ������ ��� ��� B �%���

�� � ������ .� ��� %� � �/�6� �� ��� ����� ���� " � ���� � ��� � � �� � $.�/� ���� � � ������ ����������� $.�/ ��� "� 7����� ��� ��� � � ��� � � �� ���� �� �#�.���/ B �������7��� �� 0�� ��� � �� �� ��� � �� "��17 �� "� �� ��� �� ��� � � ������ ���� �� �� �� � "� ���� B �� 7������� �

��� � #�.���/ B �������� F +�"��� ����� �

��� � ������

��� B ������� 7���

����� ���� B ������

��� ��� ��� � � ���� +� ���1

������ � 7�� �� ��� � �� ��� �� � �� � ,8@: +� �� � "��� ��1 ? ��� � �"������ � ����������� �� � +�� ����� �� �� �� � � �� �� ���� � � � �����1

References

Page 261: A quick introduction to Statistical Learning Theory and Support Vector Machines

������ � � �� � �� ����� � �� ������ � �� ��� ���� � �� �� "� �� �1�� � ��� � �� �� ������ � � ��� � ���� "�� � ����� � 1 *� ��� ��� ��� ���� � ������� � � ���� � � �� � � � ���� ��

������

��� ��� � �����

��� ����

�� ��� + ���� ���� �3�� � �� � � � ��� � ������ & � �� ���� �� ��� ��� � &� � � � � � ��� ��� ��� � & � � � � � �1 *��� ���� ��;�� �� � � ��� �� +���� ��� ��� ��� � ���� � ������� �1 �� � ��� � ��� � �"�3� ��� � ��� ��� �� �� ��� � � ���� � ��� � � � � �� ��� � ��� � ��� ���� � ������� �8&�@:1

������ � 7�� �� ��� � "��� ��"� � �� ����1 �� ���� ����� �� ������� ������ �� � � � � � � �"�3 �� " � �;�� �� ��3��� � �������� �� � �� � � ���� � ������ ���� �������� �� � �� � � ��� ��� �"�3 �� "� 7�� ��� � � �,����� � ?���� %1�1

K

x0

?���� %1� ��3�� ����� ��� �A �� � � � � � � �"�3

������� ��� �� "� ��� "� ��� �� � �� ��� ��� � ��� � �"�3 ��� � � ��� � ���� "�� � ����� � ��� � � ��� � �� ������ ��+��� "� ��� "�= ���

� B ����������

�� � �� B ��3�����

���.��/ � #��

.��/

����

+���� �� ��3��� � �� ��� � ����"�� �� � �� ��� � � ���� �� �0��� � �� �� � ����"�� �� �� � "� ��� �� � "�� ��� ��� � � ����+� �� � ��� 1�1 ��� � ��� �

��� B ��� � �����

����

������ �� " B � A � B ���� � � "�� � � "�� 6�" �� �� " � � � � �"�3 ��1?������ ��� "� �"� B ' ���� "� ��� "� ��� ��� �� � � � � �� � ���� �"� �� �� ��� %1�� +� ��"�

� B �����

��� B ��3�����

�#�.��/

�����

# � �������

��� B ����������

�� � ���

����#�.��/ B ���

.��/� #��.��/�

References

Page 262: A quick introduction to Statistical Learning Theory and Support Vector Machines

7��� +� ��"� �� "�� �� 0�� ��� � �� �� ���17 �� "� �� ��� �� ���� ���� �� �� �� � "� ��� �� � "�� ��� � ��� B �� ���� B �� � ��� �� �� ��� %1�� +� ��"� ���� � ��� �

��� B ����� �

��� B ������

��� B

��� � ��������� � ��� � � ���� +� �� � ���

������ 7�� �� ��� � "��� ��"� � �� ����1 �� ���� ����� �� ������� ������ ��+��� �� + � �"�3 ��� "� ��� "� � �;�� �� ��3��� � �� ������ ��+��� � ��� � ���� � ���� � ������ ���� �������� "� ���"� �� �� � �� ���1 7�� ��� � � �,����� � ?���� %1)1

K1

K2

?���� %1) ��3�� ����� ��� �A �� � � � �"�3 �� ��� ��

������� � �� .��� ��/� .��� ��/� � � � � .��� ��/ � ��� � ��� � �� �� ����� ���(� �1 �� � B �� A �� B � ��� � B �� A �� B ��� �� �(.�/ ��� �(.�/ ��� ��� �"�3 �� � � � ��� � ������"� �1 ���� �� �� � � ���� ����� � � � � � � � � ��� ����� � ) � � � � � �1 7��� ������.�/ B �� ��� #�����.�/ B )�7�� �� � �� � ��3�4�� 8������.�/� #�����.�/:����� ��� �� +��� �� �� ��� +�� ��4� � �� � ��A

�������

)����� � .�� )/

������ ���� �� � � � � ��� �� � �

���� �� � ) � � ��� �� � ��

�� �� ��� %1) � ��� �� � �� �

�����

)�H����*��� �H����+����

������ H*� B �� H+� B �

*� � �� +� � ��

7�� �3��� � �� +� �� � � �� ������� � ���� 0���� �� ��3��� ����� ��(+��� + � ����� � �;�"� �� 0���� �� ������ ����� �� � �� � �"�3 �� � � ���� � ���1

������� � # + ���� �� �� � ��� � � �� �� "� �3��� � ��� �������� �1 ��� � � � �1 �� �.�� �/ B � A � B H*���� H*� B �� � � *� � �� �� � � ����.�� �/ B � A � B H+���� H+� B �� � � +� � �� �� � �� �.�� �/ ��� �.�� �/��� �� �� �� ������� � �"�3 �� � � � ��� � ������"� �1 ���� �� �� �.�� �/��� �.�� �/ ��� ������� � � +1 � ���� �3�� � � ��� ���� �� ����� � � � �� � � �.�� �/ ��� ����� � ) � � � � � �.�� �/1 7��� +� ��"�

���� �� � �� ��� � � ��� �� � �

References

Page 263: A quick introduction to Statistical Learning Theory and Support Vector Machines

+���� �� B � � � � �.�� �/ � � �� �� B �� ���� ��� ��� � �� ��

���� �� � ) C ,�� � � ��� �� � �

+���� �� B � � � � �.�� �/ � � �� ,� B ���� �� � )� 6�" �� �� �� � �� ,� � �1 � +� ��"�

���� �� � �� �H������� � � � ��� �� � �.�� �/

���� �� � ) C �H����,�� � � ��� �� � �.�� �/�

7�� ������������.�/ B �� �H������

#������.�/ B ) C �H����,��

� ��3�4��8�������.�/ � #������.�/:�����

� �;�"� �� ��3�4��

8.�� �H������/ � .) C �H����,�/:������

7�� ��� ��� �� �� +��� �� �� �� � +�� ��4� � �� � ��A

�������

�.H�� C H,�/ C�

)����� � .�� )/

������ ���� �� � �� ��� � � ��� �� � �

���� �� � ) C ,�� � � ��� �� � ��

�� �� ��� %1) �� � ���� ��� �� �� +��� �� �� �� � +�� ��4� ��� � ��A

�����

)�H����*��� �H����+����

������ H*� B �� H+� B �

� � *� � �� � � +� � ��

7�� �3��� � �� +� �� � � �� �������� � ���� 0���� �� ��3��� � � �������+��� + � ����� � �;�"� �� 0���� �� � �� � �� � �� ������� � �"�3�� � � ���� � ���1

������ 7�� �������� � �� �� "� + �3��� �� ��� 0�� � �� � ��� � ���(��5� �����8�:1 �� ��� � ��� �� � ����� � � ���� �� ����� � G�����(G���(7��,�� ��� � ���� �1

� ��� ���

�� �� ��� � +� ������ �� � � � � �-� � ���0��� � F ��� ������1

������� ��� �� "� ��� "� ��� �� � �� ��� ��� � ��� � �"�3 ��� � ���� F ��� ����� �1 �� �."��"�/ B ��� � ��� +���� �� � "�� �� � "� 1 ��� B �� � ��� ���� �� B ��� ���� �� B ��� �� � ��� �� �� B � A ����� B �� ����� B � A ����� B ��= ���A .�/ �� ������ ��+��� �� ��� �� � ��� �������� ��+��� "� ��� "�= .)/ �� � � +�� ���� � ������ ��� � "� ��� ��

� � ����� ���� � ������ ��� � "��

������ *� 0�� � �� + �� 0�� ���1 �� �� ��� '1)� +� ��"�

�.��� ��/ B�� � �����

B���� �� � ���� ��

���

B��� � ��� ��

���B�����

���

B ��� B �."��"�/�

References

Page 264: A quick introduction to Statistical Learning Theory and Support Vector Machines

7 �� "� �� ��� �� ���� ���� �� �� ���� +��� � � � � � "� � � +��� ����� ���� �� * B � C .�� /��� � ��� ���

�*� ����

B �� C .�� /�� � ��� � C .�� /�� � ���

B �.� � ��/ C .�� � ��/� .�� ��/ C .�� � ��/�

B ��� � ���� C ���� ���

� C )��� ��� �� � ����

��

� B )��� � ����

��� �����

���� ���� �� B �� ��� ����� � ��� +� ��"� � � �� �� � B ���� �� 7����� � ��� � .�� �/� +� ��"� �*� ��� � ��� � ���� 7�� � ������ �� � �� ��� �."��"�/ B ��� � ���� ����� �� �� � �"�3� � "�� +� ��"� * � "�� F��������� � �� � � � � � "�� ��� � �� �� +� ��"� ����� � �� � � � � � "��7�� �� "�� �� �� ���1

������ �� �� ����� �� �� ��� ����� �� � �� � "� ��� �� � "� ��� ���� � � �� �� ��� � ��� ���� ���� ��� �� ������ ��� ������� ��� �� � �� � �� �� ������ ��+��� �� ��� �� � � �������� ������ ��� � �"� ��� "� +� �� ��3�� �����1 7�� ��� � � �,����� � ?���� &1�1

K1

K2

?���� &1� 7�� ��3�� ����� �������� ������ ���

������� ��� �� "� ��� "� ��� �� � �� ��� ��� � ��� � �"�3 ��� � � ��� F ��� ����� �� 7��� ���� � �� ��� � � �� �������� ������ ��� � � "� ���"� +� �� ��3�� �����1

������ �� �� ��� &1�� ���� �3�� �� �������� ������ ��� � � �� � �"�3 ���"� ��� "� +� �� ��3�� �����1 # + +� � � ���� �� + �� ��;������ � ���� ������ ���1 ���� �� �� �� � "�� �� � "�� � � � �����

�."��"�/ B ��� � ��� B���

.�/ � #��.�/

����

�� �� ��� �� �� �� +�� ��� ����� ���� � ������ ���� � �� � �"�3 ���"� ��� "� ������"� � ������ �� �� 1�1 �� B � A ����� B ���

.�/ ����� B � A ����� B #��

.�/� �� �� ��� %1) +� ��"� ������� �� B ������������� �� B �� � ��� �� ������(���+��4 ��;�� �� ���� �3�� � � �� ���� ��� B ���� 7��� �� B � A ��� ���� B ���

.���/ B � A ������ B ���.��/�

��� � �� �� �� B � A ������ B #��.��/� 7���� ���� �� ��� �� ��� ������

References

Page 265: A quick introduction to Statistical Learning Theory and Support Vector Machines

� � �� ���7�� �� "�� �� �� ���1

������ # � �� �� � ���5 ������ � �� �� �� � �� ������ ����� �� �"� ��� "�� �� ���� ���� �� �� ��� �� � "�� ��� �� � "� �����

�."��"�/ B ��� � ��� B ��� � ����

��� +� ��"� �� � �� B �� � ��� �� - B � A � B � � �� � � �� � � �� 7 �� "� �� ������� +� � � ���� �� "� �� ���� �3�� � � �� � ���� � � +������ �� ��� � ��1 �� � ��� � �� ��� � ��1 ���� �� �� ���� ��� ��� �� � -�� ��������� B ���� B �� 7���

��� � ���� B ��� � ��� �� � ���

B )����� C )����

� � ��� C ����

B )�� C )�� � )���� C ��

)��

���� .�� C ��/�) � -� +� ��"�

��� C ��

)� � ��

7�� ����� ��� � ��� � �� �� +� � +��� ��"� ���� ��� � �� � ���� ��� B �� 1�1�� B ��� 7�� �� "�� �� ������1

� ���� ���

*� � ����� �� � ����"� �� ���� �����+ �, � � �-� � ���0���1 *�� �������+ �,� �������� ������ ����� ��� ������ �� ��� � � �� � �-� � ��(�0��� � ��� � ������� � ��� � �� � ���� �� ���� ���� �1 # + � ,��� ��-� � ��� � �� ��� ��� ��� ������ �� �� ���� �� �������(� �� �������� � � �2���� �3����� �� � �� ����� ��� � �� � �� ���� ������ �1 � �(����� +� �� ��4� � �� �� � �-� � ���0���� +� � �5 ���� ��������� ������ � ���� �� ��� � ��1 ������� �"��� ��� � �� �� �� � ����� ��� ���� �� ���� �� ��1 � �� �����+ �, ��� ��,� �� � � ��������� �-�� � � � "��� ���� �1 �� ��� �� � � � �� ��� �3� �� �� � ��� � ����+��� �-� ��� ��� ������ � � ����1

!������"����#$

����� �� I� � +���� ���, ���� ��� � ��+����� ���,� �� #� �� ��� (�� ��� �� D� ��� $��3� I���� � ������ ������� � �������� ��� � ��� I����� J�� F�� � #� �� !�"���� � ������ ��� 7���� �� � ���� � � �� ��� ������ �� ��� � �����1

��%����!�$

"!# $� �� %���� & '� %������ ������ (����)� *���� �� ��� +���� �� �� �,- ������.������ ��������� �� � � ���������� ������������� ���������� �� ��� ��� �������� �/��01234� -��/�� $��������

"�# � %� & $� �� %���� � (����)� *���� �� +���� ��� ��� ���� ,�� �� �/�������� ������ �� ������� �� ������ ����������� ������� ������

"5# �� %��/�� & *� ����� (����)� 6��7������ �� �� �,- ���� ���� �� ������� �� ����������������� ������� ����� ��� �/�� ��52��8� �������/�� -�� -�� ������

References

Page 266: A quick introduction to Statistical Learning Theory and Support Vector Machines

"4# *� ���� & �� %��/��� (����)� � /���� ��� �� ���� � ��� �� �2�,- ������.���� ��������� �� ������ ����������� ������� ����� ��� �/�� �442�0!� �������/�� -��-�� ������

"0# �� ���/� (!885)� ���� ��� ���������� ������ (����� '�� ���)� ����/��2,����/� ��9:��;�

"3# *� +� �������/��� (!813)� � ����!����� "� #����� � ��� ��� ��� ��� <���� & �������9 :��;�

"1# =� �� -��/�������� (!888)� ���� ����2���� ����� ��/ ������ � ������� ������ ������ �4� �/�� !02�5�

"># +� ?� ���� %� ���?��;��� �� -�;� & $� -?������ ( ����)� �,- ��� %��� ��/@ =�������� ��������� ��� !!8� +-* A���� %������

"8# �� ���� ������ & � ���9�2������� (����)� �� ������������ �� �� ��� #����� ��� �����������/� 6��B���� � ������ �������/�� 6$�

"!�# ,� ,���;� (!880)� $ � ������ �� ���������� ������� $ ����� ����/��2,����/� ��9:��;�

References

Page 267: A quick introduction to Statistical Learning Theory and Support Vector Machines

Journal of Machine Learning Research (2001) 45-66 Submitted 10/01; Published 11/01

Support Vector Machine Active Learningwith Applications to Text Classification

Simon Tong [email protected]

Daphne Koller [email protected]

Computer Science DepartmentStanford UniversityStanford CA 94305-9010, USA

Editor: Leslie Pack Kaelbling

Abstract

Support vector machines have met with significant success in numerous real-world learningtasks. However, like most machine learning algorithms, they are generally applied usinga randomly selected training set classified in advance. In many settings, we also have theoption of using pool-based active learning. Instead of using a randomly selected trainingset, the learner has access to a pool of unlabeled instances and can request the labels forsome number of them. We introduce a new algorithm for performing active learning withsupport vector machines, i.e., an algorithm for choosing which instances to request next.We provide a theoretical motivation for the algorithm using the notion of a version space.We present experimental results showing that employing our active learning method cansignificantly reduce the need for labeled training instances in both the standard inductiveand transductive settings.Keywords: Active Learning, Selective Sampling, Support Vector Machines, Classifica-tion, Relevance Feedback

1. Introduction

In many supervised learning tasks, labeling instances to create a training set is time-consuming and costly; thus, finding ways to minimize the number of labeled instancesis beneficial. Usually, the training set is chosen to be a random sampling of instances. How-ever, in many cases active learning can be employed. Here, the learner can actively choosethe training data. It is hoped that allowing the learner this extra flexibility will reduce thelearner’s need for large quantities of labeled data.

Pool-based active learning for classification was introduced by Lewis and Gale (1994).The learner has access to a pool of unlabeled data and can request the true class label fora certain number of instances in the pool. In many domains this is a reasonable approachsince a large quantity of unlabeled data is readily available. The main issue with activelearning is finding a way to choose good requests or queries from the pool.

Examples of situations in which pool-based active learning can be employed are:

• Web searching. A Web-based company wishes to search the web for particular typesof pages (e.g., pages containing lists of journal publications). It employs a number ofpeople to hand-label some web pages so as to create a training set for an automatic

c©2001 Simon Tong and Daphne Koller.

References

UPCLab2013
Typewritten text
REF [10]
Page 268: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

classifier that will eventually be used to classify the rest of the web. Since humanexpertise is a limited resource, the company wishes to reduce the number of pagesthe employees have to label. Rather than labeling pages randomly drawn from theweb, the computer requests targeted pages that it believes will be most informativeto label.

• Email filtering. The user wishes to create a personalized automatic junk email filter.In the learning phase the automatic learner has access to the user’s past email files.It interactively brings up past email and asks the user whether the displayed email isjunk mail or not. Based on the user’s answer it brings up another email and queriesthe user. The process is repeated some number of times and the result is an emailfilter tailored to that specific person.

• Relevance feedback. The user wishes to sort through a database or website foritems (images, articles, etc.) that are of personal interest—an “I’ll know it when Isee it” type of search. The computer displays an item and the user tells the learnerwhether the item is interesting or not. Based on the user’s answer, the learner bringsup another item from the database. After some number of queries the learner thenreturns a number of items in the database that it believes will be of interest to theuser.

The first two examples involve induction. The goal is to create a classifier that workswell on unseen future instances. The third example is an example of transduction(Vapnik,1998). The learner’s performance is assessed on the remaining instances in the databaserather than a totally independent test set.

We present a new algorithm that performs pool-based active learning with supportvector machines (SVMs). We provide theoretical motivations for our approach to choosingthe queries, together with experimental results showing that active learning with SVMs cansignificantly reduce the need for labeled training instances.

We shall use text classification as a running example throughout this paper. This isthe task of determining to which pre-defined topic a given text document belongs. Textclassification has an important role to play, especially with the recent explosion of readilyavailable text data. There have been many approaches to achieve this goal (Rocchio, 1971,Dumais et al., 1998, Sebastiani, 2001). Furthermore, it is also a domain in which SVMshave shown notable success (Joachims, 1998, Dumais et al., 1998) and it is of interest tosee whether active learning can offer further improvement over this already highly effectivemethod.

The remainder of the paper is structured as follows. Section 2 discusses the use ofSVMs both in terms of induction and transduction. Section 3 then introduces the notionof a version space and Section 4 provides theoretical motivation for three methods forperforming active learning with SVMs. In Section 5 we present experimental results fortwo real-world text domains that indicate that active learning can significantly reduce theneed for labeled instances in practice. We conclude in Section 7 with some discussion of thepotential significance of our results and some directions for future work.

46

References

Page 269: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

(a) (b)

Figure 1: (a) A simple linear support vector machine. (b) A SVM (dotted line) and atransductive SVM (solid line). Solid circles represent unlabeled instances.

2. Support Vector Machines

Support vector machines (Vapnik, 1982) have strong theoretical foundations and excellentempirical successes. They have been applied to tasks such as handwritten digit recognition,object recognition, and text classification.

2.1 SVMs for Induction

We shall consider SVMs in the binary classification setting. We are given training data{x1 . . .xn} that are vectors in some space X ⊆ Rd. We are also given their labels {y1 . . . yn}where yi ∈ {−1, 1}. In their simplest form, SVMs are hyperplanes that separate the trainingdata by a maximal margin (see Fig. 1a) . All vectors lying on one side of the hyperplaneare labeled as −1, and all vectors lying on the other side are labeled as 1. The traininginstances that lie closest to the hyperplane are called support vectors. More generally, SVMsallow one to project the original training data in space X to a higher dimensional featurespace F via a Mercer kernel operator K. In other words, we consider the set of classifiersof the form:

f(x) =

(n∑

i=1

αiK(xi,x)

). (1)

When K satisfies Mercer’s condition (Burges, 1998) we can write: K(u,v) = Φ(u) · Φ(v)where Φ : X → F and “·” denotes an inner product. We can then rewrite f as:

f(x) = w · Φ(x), where w =n∑

i=1

αiΦ(xi). (2)

Thus, by using K we are implicitly projecting the training data into a different (oftenhigher dimensional) feature space F . The SVM then computes the αis that correspondto the maximal margin hyperplane in F . By choosing different kernel functions we can

47

References

Page 270: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

implicitly project the training data from X into spaces F for which hyperplanes in Fcorrespond to more complex decision boundaries in the original space X .

Two commonly used kernels are the polynomial kernel given by K(u,v) = (u · v + 1)p

which induces polynomial boundaries of degree p in the original space X 1 and the radial basisfunction kernel K(u,v) = (e−γ(u−v)·(u−v)) which induces boundaries by placing weightedGaussians upon key training instances. For the majority of this paper we will assume thatthe modulus of the training data feature vectors are constant, i.e., for all training instancesxi, ‖Φ(xi)‖ = λ for some fixed λ. The quantity ‖Φ(xi)‖ is always constant for radial basisfunction kernels, and so the assumption has no effect for this kernel. For ‖Φ(xi)‖ to beconstant with the polynomial kernels we require that ‖xi‖ be constant. It is possible torelax this constraint on Φ(xi) and we shall discuss this at the end of Section 4.

2.2 SVMs for Transduction

The previous subsection worked within the framework of induction. There was a labeledtraining set of data and the task was to create a classifier that would have good performanceon unseen test data. In addition to regular induction, SVMs can also be used for transduc-tion. Here we are first given a set of both labeled and unlabeled data. The learning task isto assign labels to the unlabeled data as accurately as possible. SVMs can perform trans-duction by finding the hyperplane that maximizes the margin relative to both the labeledand unlabeled data. See Figure 1b for an example. Recently, transductive SVMs (TSVMs)have been used for text classification (Joachims, 1999b), attaining some improvements inprecision/recall breakeven performance over regular inductive SVMs.

3. Version Space

Given a set of labeled training data and a Mercer kernel K, there is a set of hyperplanes thatseparate the data in the induced feature space F . We call this set of consistent hypothesesthe version space (Mitchell, 1982). In other words, hypothesis f is in version space if forevery training instance xi with label yi we have that f(xi) > 0 if yi = 1 and f(xi) < 0 ifyi = −1. More formally:

Definition 1 Our set of possible hypotheses is given as:

H ={

f | f(x) =w · Φ(x)‖w‖ where w ∈ W

},

where our parameter space W is simply equal to F . The version space, V is then definedas:

V = {f ∈ H | ∀i ∈ {1 . . . n} yif(xi) > 0}.Notice that since H is a set of hyperplanes, there is a bijection between unit vectors w andhypotheses f in H. Thus we will redefine V as:

V = {w ∈ W | ‖w‖ = 1, yi(w · Φ(xi)) > 0, i = 1 . . . n}.1. We have not introduced a bias weight in Eq. (2). Thus, the simple Euclidean inner product will produce

hyperplanes that pass through the origin. However, a polynomial kernel of degree one induces hyperplanesthat do not need to pass through the origin.

48

References

Page 271: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

(a) (b)

Figure 2: (a) Version space duality. The surface of the hypersphere represents unit weightvectors. Each of the two hyperplanes corresponds to a labeled training instance.Each hyperplane restricts the area on the hypersphere in which consistent hy-potheses can lie. Here, the version space is the surface segment of the hypersphereclosest to the camera. (b) An SVM classifier in a version space. The dark em-bedded sphere is the largest radius sphere whose center lies in the version spaceand whose surface does not intersect with the hyperplanes. The center of the em-bedded sphere corresponds to the SVM, its radius is proportional to the marginof the SVM in F , and the training points corresponding to the hyperplanes thatit touches are the support vectors.

Note that a version space only exists if the training data are linearly separable in thefeature space. Thus, we require linear separability of the training data in the feature space.This restriction is much less harsh than it might at first seem. First, the feature space oftenhas a very high dimension and so in many cases it results in the data set being linearlyseparable. Second, as noted by Shawe-Taylor and Cristianini (1999), it is possible to modifyany kernel so that the data in the new induced feature space is linearly separable2.

There exists a duality between the feature space F and the parameter space W (Vapnik,1998, Herbrich et al., 2001) which we shall take advantage of in the next section: points inF correspond to hyperplanes in W and vice versa.

By definition, points in W correspond to hyperplanes in F . The intuition behind theconverse is that observing a training instance xi in the feature space restricts the set ofseparating hyperplanes to ones that classify xi correctly. In fact, we can show that the set

2. This is done by redefining for all training instances xi: K(xi,xi) ← K(xi,xi) + ν where ν is a positiveregularization constant. This essentially achieves the same effect as the soft margin error function (Cortesand Vapnik, 1995) commonly used in SVMs. It permits the training data to be linearly non-separablein the original feature space.

49

References

Page 272: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

of allowable points w in W is restricted to lie on one side of a hyperplane in W. Moreformally, to show that points in F correspond to hyperplanes in W, suppose we are givena new training instance xi with label yi. Then any separating hyperplane must satisfyyi(w · Φ(xi)) > 0. Now, instead of viewing w as the normal vector of a hyperplane in F ,think of Φ(xi) as being the normal vector of a hyperplane in W. Thus yi(w · Φ(xi)) > 0defines a half space in W. Furthermore w · Φ(xi) = 0 defines a hyperplane in W that actsas one of the boundaries to version space V. Notice that the version space is a connectedregion on the surface of a hypersphere in parameter space. See Figure 2a for an example.

SVMs find the hyperplane that maximizes the margin in the feature space F . One wayto pose this optimization task is as follows:

maximizew∈F mini{yi(w · Φ(xi))}subject to: ‖w‖ = 1

yi(w · Φ(xi)) > 0 i = 1 . . . n.

By having the conditions ‖w‖ = 1 and yi(w · Φ(xi)) > 0 we cause the solution to lie in theversion space. Now, we can view the above problem as finding the point w in the versionspace that maximizes the distance: mini{yi(w · Φ(xi))}. From the duality between featureand parameter space, and since ‖Φ(xi)‖ = λ , each Φ(xi)/λ is a unit normal vector of ahyperplane in parameter space. Because of the constraints yi(w · Φ(xi)) > 0 i = 1 . . . neach of these hyperplanes delimit the version space. The expression yi(w · Φ(xi)) can beregarded as:

λ × the distance between the point w and the hyperplane with normal vector Φ(xi).

Thus, we want to find the point w∗ in the version space that maximizes the minimumdistance to any of the delineating hyperplanes. That is, SVMs find the center of the largestradius hypersphere whose center can be placed in the version space and whose surface doesnot intersect with the hyperplanes corresponding to the labeled instances, as in Figure 2b.

The normals of the hyperplanes that are touched by the maximal radius hypersphere arethe Φ(xi) for which the distance yi(w∗ · Φ(xi)) is minimal. Now, taking the original ratherthan the dual view, and regarding w∗ as the unit normal vector of the SVM and Φ(xi) aspoints in feature space, we see that the hyperplanes that are touched by the maximal radiushypersphere correspond to the support vectors (i.e., the labeled points that are closest tothe SVM hyperplane boundary).

The radius of the sphere is the distance from the center of the sphere to one of thetouching hyperplanes and is given by yi(w∗ · Φ(xi)/λ) where Φ(xi) is a support vector.Now, viewing w∗ as a unit normal vector of the SVM and Φ(xi) as points in feature space,we have that the distance yi(w∗ · Φ(xi)/λ) is:

1λ× the distance between support vector Φ(xi) and the hyperplane with normal vector w,

which is the margin of the SVM divided by λ. Thus, the radius of the sphere is proportionalto the margin of the SVM.

50

References

Page 273: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

4. Active Learning

In pool-based active learning we have a pool of unlabeled instances. It is assumed thatthe instances x are independently and identically distributed according to some underlyingdistribution F (x) and the labels are distributed according to some conditional distributionP (y | x).

Given an unlabeled pool U , an active learner � has three components: (f, q, X). Thefirst component is a classifier, f : X → {−1, 1}, trained on the current set of labeled data X(and possibly unlabeled instances in U too). The second component q(X) is the queryingfunction that, given a current labeled set X, decides which instance in U to query next.The active learner can return a classifier f after each query (online learning) or after somefixed number of queries.

The main difference between an active learner and a passive learner is the queryingcomponent q. This brings us to the issue of how to choose the next unlabeled instance toquery. Similar to Seung et al. (1992), we use an approach that queries points so as to attemptto reduce the size of the version space as much as possible. We take a myopic approachthat greedily chooses the next query based on this criterion. We also note that myopia is astandard approximation used in sequential decision making problems Horvitz and Rutledge(1991), Latombe (1991), Heckerman et al. (1994). We need two more definitions before wecan proceed:

Definition 2 Area(V) is the surface area that the version space V occupies on the hyper-sphere ‖w‖ = 1.

Definition 3 Given an active learner �, let Vi denote the version space of � after i querieshave been made. Now, given the (i + 1)th query xi+1, define:

V−i = Vi ∩ {w ∈ W | −(w · Φ(xi+1)) > 0},

V+i = Vi ∩ {w ∈ W | +(w · Φ(xi+1)) > 0}.

So V−i and V+

i denote the resulting version spaces when the next query xi+1 is labeled as−1 and 1 respectively.

We wish to reduce the version space as fast as possible. Intuitively, one good way ofdoing this is to choose a query that halves the version space. The follow lemma says that,for any given number of queries, the learner that chooses successive queries that halvesthe version spaces is the learner that minimizes the maximum expected size of the versionspace, where the maximum is taken over all conditional distributions of y given x:

51

References

Page 274: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

Lemma 4 Suppose we have an input space X , finite dimensional feature space F (inducedvia a kernel K), and parameter space W. Suppose active learner �∗ always queries instanceswhose corresponding hyperplanes in parameter space W halves the area of the current versionspace. Let � be any other active learner. Denote the version spaces of �∗ and � after i queriesas V∗

i and Vi respectively. Let P denote the set of all conditional distributions of y given x.Then,

∀i ∈ N+ supP∈P

EP [Area(V∗i )] ≤ sup

P∈PEP [Area(Vi)],

with strict inequality whenever there exists a query j ∈ {1 . . . i} by � that does not halveversion space Vj−1.

Proof. The proof is straightforward. The learner, �∗ always chooses to query instancesthat halve the version space. Thus Area(V∗

i+1) = 12Area(V∗

i ) no matter what the labelingof the query points are. Let r denote the dimension of feature space F . Then r is also thedimension of the parameter space W. Let Sr denote the surface area of the unit hypersphereof dimension r. Then, under any conditional distribution P , Area(V∗

i ) = Sr/2i.Now, suppose � does not always query an instance that halves the area of the version

space. Then after some number, k, of queries � first chooses to query a point xk+1 thatdoes not halve the current version space Vk. Let yk+1 ∈ {−1, 1} correspond to the labelingof xk+1 that will cause the larger half of the version space to be chosen.

Without loss of generality assume Area(V−k ) > Area(V+

k ) and so yk+1 = −1. Note thatArea(V−

k ) + Area(V+k ) = Sr/2k, so we have that Area(V−

k ) > Sr/2k+1.Now consider the conditional distribution P0:

P0(−1 | x) ={

12 if x �= xk+1

1 if x = xk+1.

Then under this distribution, ∀i > k,

EP0 [Area(Vi)] =1

2i−k−1Area(V−

k ) >Sr

2i.

Hence, ∀i > k,supP∈P

EP [Area(V∗i )] > sup

P∈PEP [Area(Vi)].

Now, suppose w∗ ∈ W is the unit parameter vector corresponding to the SVM that wewould have obtained had we known the actual labels of all of the data in the pool. Weknow that w∗ must lie in each of the version spaces V1 ⊃ V2 ⊃ V3 . . ., where Vi denotes theversion space after i queries. Thus, by shrinking the size of the version space as much aspossible with each query, we are reducing as fast as possible the space in which w∗ can lie.Hence, the SVM that we learn from our limited number of queries will lie close to w∗.

If one is willing to assume that there is a hypothesis lying within H that generates thedata and that the generating hypothesis is deterministic and that the data are noise free,then strong generalization performance properties of an algorithm that halves version spacecan also be shown (Freund et al., 1997). For example one can show that the generalizationerror decreases exponentially with the number of queries.

52

References

Page 275: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

(a) (b)

Figure 3: (a) Simple Margin will query b. (b) Simple Margin will query a.

(a) (b)

Figure 4: (a) MaxMin Margin will query b. The two SVMs with margins m− and m+ for bare shown. (b) Ratio Margin will query e. The two SVMs with margins m− andm+ for e are shown.

This discussion provides motivation for an approach where we query instances that splitthe current version space into two equal parts as much as possible. Given an unlabeledinstance x from the pool, it is not practical to explicitly compute the sizes of the newversion spaces V− and V+ (i.e., the version spaces obtained when x is labeled as −1 and+1 respectively). We next present three ways of approximating this procedure.

• Simple Margin. Recall from section 3 that, given some data {x1 . . .xi} and labels{y1 . . . yi}, the SVM unit vector wi obtained from this data is the center of the largesthypersphere that can fit inside the current version space Vi. The position of wi inthe version space Vi clearly depends on the shape of the region Vi, however it isoften approximately in the center of the version space. Now, we can test each of theunlabeled instances x in the pool to see how close their corresponding hyperplanesin W come to the centrally placed wi. The closer a hyperplane in W is to the pointwi, the more centrally it is placed in the version space, and the more it bisects theversion space. Thus we can pick the unlabeled instance in the pool whose hyperplane

53

References

Page 276: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

in W comes closest to the vector wi. For each unlabeled instance x, the shortestdistance between its hyperplane in W and the vector wi is simply the distance betweenthe feature vector Φ(x) and the hyperplane wi in F—which is easily computed by|wi · Φ(x)|. This results in the natural rule: learn an SVM on the existing labeleddata and choose as the next instance to query the instance that comes closest to thehyperplane in F .

Figure 3a presents an illustration. In the stylized picture we have flattened out thesurface of the unit weight vector hypersphere that appears in Figure 2a. The whitearea is version space Vi which is bounded by solid lines corresponding to labeledinstances. The five dotted lines represent unlabeled instances in the pool. The circlerepresents the largest radius hypersphere that can fit in the version space. Note thatthe edges of the circle do not touch the solid lines—just as the dark sphere in 2bdoes not meet the hyperplanes on the surface of the larger hypersphere (they meetsomewhere under the surface). The instance b is closest to the SVM wi and so wewill choose to query b.

• MaxMin Margin. The Simple Margin method can be a rather rough approximation.It relies on the assumption that the version space is fairly symmetric and that wi iscentrally placed. It has been demonstrated, both in theory and practice, that theseassumptions can fail significantly (Herbrich et al., 2001). Indeed, if we are not carefulwe may actually query an instance whose hyperplane does not even intersect theversion space. The MaxMin approximation is designed to overcome these problems tosome degree. Given some data {x1 . . .xi} and labels {y1 . . . yi}, the SVM unit vectorwi is the center of the largest hypersphere that can fit inside the current versionspace Vi and the radius mi of the hypersphere is proportional3 to the size of themargin of wi. We can use the radius mi as an indication of the size of the versionspace (Vapnik, 1998). Suppose we have a candidate unlabeled instance x in the pool.We can estimate the relative size of the resulting version space V− by labeling x as −1,finding the SVM obtained from adding x to our labeled training data and looking atthe size of its margin m−. We can perform a similar calculation for V+ by relabelingx as class +1 and finding the resulting SVM to obtain margin m+.

Since we want an equal split of the version space, we wish Area(V−) and Area(V+) tobe similar. Now, consider min(Area(V−),Area(V+)). It will be small if Area(V−) andArea(V+) are very different. Thus we will consider min(m−, m+) as an approximationand we will choose to query the x for which this quantity is largest. Hence, the MaxMinquery algorithm is as follows: for each unlabeled instance x compute the margins m−

and m+ of the SVMs obtained when we label x as −1 and +1 respectively; then chooseto query the unlabeled instance for which the quantity min(m−, m+) is greatest.

Figures 3b and 4a show an example comparing the Simple Margin and MaxMin Marginmethods.

• Ratio Margin. This method is similar in spirit to the MaxMin Margin method. Weuse m− and m+ as indications of the sizes of V− and V+. However, we shall try to

3. To ease notation, without loss of generality we shall assume the the constant of proportionality is 1, i.e.,the radius is equal to the margin.

54

References

Page 277: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

take into account the fact that the current version space Vi may be quite elongatedand for some x in the pool both m− and m+ may be small simply because of the shapeof version space. Thus we will instead look at the relative sizes of m− and m+ andchoose to query the x for which min(m−

m+ , m+

m− ) is largest (see Figure 4b).

The above three methods are approximations to the querying component that alwayshalves version space. After performing some number of queries we then return a classifierby learning a SVM with the labeled instances.

The margin can be used as an indication of the version space size irrespective of whetherthe feature vectors have constant modulus. Thus the explanation for the MaxMin and Ratiomethods still holds even without the constraint on the modulus of the training featurevectors. The Simple method can still be used when the training feature vectors do nothave constant modulus, but the motivating explanation no longer holds since the maximalmargin hyperplane can no longer be viewed as the center of the largest allowable sphere.However, for the Simple method, alternative motivations have recently been proposed byCampbell et al. (2000) that do not require the constraint on the modulus.

For inductive learning, after performing some number of queries we then return a classi-fier by learning a SVM with the labeled instances. For transductive learning, after queryingsome number of instances we then return a classifier by learning a transductive SVM withthe labeled and unlabeled instances.

5. Experiments

For our empirical evaluation of the above methods we used two real-world text classificationdomains: the Reuters-21578 data set and the Newsgroups data set.

5.1 Reuters Data Collection Experiments

The Reuters-21578 data set4 is a commonly used collection of newswire stories categorizedinto hand-labeled topics. Each news story has been hand-labeled with some number of topiclabels such as “corn”, “wheat” and “corporate acquisitions”. Note that some of the topicsoverlap and so some articles belong to more than one category. We used the 12902 articlesfrom the “ModApte” split of the data5 and, to stay comparable with previous studies, weconsidered the top ten most frequently occurring topics. We learned ten different binaryclassifiers, one to distinguish each topic. Each document was represented as a stemmed,TFIDF-weighted word frequency vector.6 Each vector had unit modulus. A stop list ofcommon words was used and words occurring in fewer than three documents were alsoignored. Using this representation, the document vectors had about 10000 dimensions.

We first compared the three querying methods in the inductive learning setting. Ourtest set consisted of the 3299 documents present in the “ModApte” test set.

4. Obtained from www.research.att.com/˜lewis.5. The Reuters-21578 collection comes with a set of predefined training and test set splits. The commonly

used“ModApte” split filters out duplicate articles and those without a labeled topic, and then uses earlierarticles as the training set and later articles as the test set.

6. We used Rainbow (McCallum, 1996) for text processing.

55

References

Page 278: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

RandomSimpleMaxMinRatio

0 20 40 60 80 100Labeled Training Set Size

70.0

80.0

90.0

100.0

Tes

t Set

Acc

urac

y

FullRatioMaxMinSimpleRandom

RandomSimpleMaxMinRatio

0 20 40 60 80 100Labeled Training Set Size

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Pre

cisi

on/R

ecal

l Bre

akev

en P

oint

FullRatioMaxMinSimpleRandom

(a) (b)

Figure 5: (a) Average test set accuracy over the ten most frequently occurring topics whenusing a pool size of 1000. (b) Average test set precision/recall breakeven pointover the ten most frequently occurring topics when using a pool size of 1000.

Topic Simple MaxMin Ratio EquivalentRandom size

Earn 86.39 ± 1.65 87.75 ± 1.40 90.24 ± 2.31 34Acq 77.04 ± 1.17 77.08 ± 2.00 80.42 ± 1.50 > 100Money-fx 93.82 ± 0.35 94.80± 0.14 94.83± 0.13 50Grain 95.53 ± 0.09 95.29 ± 0.38 95.55 ± 1.22 13Crude 95.26 ± 0.38 95.26 ± 0.15 95.35 ± 0.21 > 100Trade 96.31 ± 0.28 96.64 ± 0.10 96.60 ± 0.15 > 100Interest 96.15 ± 0.21 96.55 ± 0.09 96.43 ± 0.09 > 100Ship 97.75 ± 0.11 97.81 ± 0.09 97.66 ± 0.12 > 100Wheat 98.10 ± 0.24 98.48 ± 0.09 98.13 ± 0.20 > 100Corn 98.31 ± 0.19 98.56 ± 0.05 98.30 ± 0.19 15

Table 1: Average test set accuracy over the top ten most frequently occurring topics (mostfrequent topic first) when trained with ten labeled documents. Boldface indicatesstatistical significance.

For each of the ten topics we performed the following steps. We created a pool ofunlabeled data by sampling 1000 documents from the remaining data and removing theirlabels. We then randomly selected two documents in the pool to give as the initial labeledtraining set. One document was about the desired topic, and the other document wasnot about the topic. Thus we gave each learner 998 unlabeled documents and 2 labeleddocuments. After a fixed number of queries we asked each learner to return a classifier (an

56

References

Page 279: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

Topic Simple MaxMin Ratio EquivalentRandom size

Earn 86.05 ± 0.61 89.03± 0.53 88.95± 0.74 12Acq 54.14 ± 1.31 56.43 ± 1.40 57.25 ± 1.61 12Money-fx 35.62 ± 2.34 38.83 ± 2.78 38.27 ± 2.44 52Grain 50.25 ± 2.72 58.19± 2.04 60.34± 1.61 51Crude 58.22 ± 3.15 55.52 ± 2.42 58.41 ± 2.39 55Trade 50.71 ± 2.61 48.78 ± 2.61 50.57 ± 1.95 85Interest 40.61 ± 2.42 45.95 ± 2.61 43.71 ± 2.07 60Ship 53.93 ± 2.63 52.73 ± 2.95 53.75 ± 2.85 > 100Wheat 64.13 ± 2.10 66.71 ± 1.65 66.57 ± 1.37 > 100Corn 49.52 ± 2.12 48.04 ± 2.01 46.25 ± 2.18 > 100

Table 2: Average test set precision/recall breakeven point over the top ten most frequentlyoccurring topics (most frequent topic first) when trained with ten labeled docu-ments. Boldface indicates statistical significance.

SVM with a polynomial kernel of degree one7 learned on the labeled training documents).We then tested the classifier on the independent test set.

The above procedure was repeated thirty times for each topic and the results wereaveraged. We considered the Simple Margin, MaxMin Margin and Ratio Margin queryingmethods as well as a Random Sample method. The Random Sample method simply ran-domly chooses the next query point from the unlabeled pool. This last method reflects whathappens in the regular passive learning setting—the training set is a random sampling ofthe data.

To measure performance we used two metrics: test set classification error and, tostay compatible with previous Reuters corpus results, the precision/recall breakeven point(Joachims, 1998). Precision is the percentage of documents a classifier labels as “relevant”that are really relevant. Recall is the percentage of relevant documents that are labeled as“relevant” by the classifier. By altering the decision threshold on the SVM we can trade pre-cision for recall and can obtain a precision/recall curve for the test set. The precision/recallbreakeven point is a one number summary of this graph: it is the point at which precisionequals recall.

Figures 5a and 5b present the average test set accuracy and precision/recall breakevenpoints over the ten topics as we vary the number of queries permitted. The horizontal lineis the performance level achieved when the SVM is trained on all 1000 labeled documentscomprising the pool. Over the Reuters corpus, the three active learning methods performalmost identically with little notable difference to distinguish between them. Each methodalso appreciably outperforms random sampling. Tables 1 and 2 show the test set accuracyand breakeven performance of the active methods after they have asked for just eight labeledinstances (so, together with the initial two random instances, they have seen ten labeledinstances). They demonstrate that the three active methods perform similarly on this

7. For SVM and transductive SVM learning we used SVMlight Joachims (1999a).

57

References

Page 280: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

0 20 40 60 80 100Labeled Training Set Size

70.0

80.0

90.0

100.0

Tes

t Set

Acc

urac

y

FullRatioRandom BalancedRandom

RandomSimpleRatio

0 20 40 60 80 100Labeled Training Set Size

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Pre

cisi

on/R

ecal

l Bre

akev

en P

oint

FullRatioRandom BalancedRandom

(a) (b)

Figure 6: (a) Average test set accuracy over the ten most frequently occurring topics whenusing a pool size of 1000. (b) Average test set precision/recall breakeven pointover the ten most frequently occurring topics when using a pool size of 1000.

Reuters data set after eight queries, with the MaxMin and Ratio showing a very slight edge inperformance. The last columns in each table are of more interest. They show approximatelyhow many instances would be needed if we were to use Random to achieve the same levelof performance as the Ratio active learning method. In this instance, passive learning onaverage requires over six times as much data to achieve comparable levels of performance asthe active learning methods. The tables indicate that active learning provides more benefitwith the infrequent classes, particularly when measuring performance by the precision/recallbreakeven point. This last observation has also been noted before in previous empiricaltests (McCallum and Nigam, 1998).

We noticed that approximately half of the queries that the active learning methodsasked tended to turn out to be positively labeled, regardless of the true overall proportionof positive instances in the domain. We investigated whether the gains that the activelearning methods had over regular Random sampling were due to this biased sampling. Wecreated a new querying method called BalancedRandom which would randomly sample anequal number of positive and negative instances from the pool. Obviously in practice theability to randomly sample an equal number of positive and negative instances withouthaving to label an entire pool of instances first may or may not be reasonable dependingupon the domain in question. Figures 6a and 6b show the average accuracy and breakevenpoint of the BalancedRandom method compared with the Ratio active method and regularRandom method on the Reuters dataset with a pool of 1000 unlabled instances. The Ratioand Random curves are the same as those shown in Figures 5a and 5b. The MaxMin andSimple curves are omitted to ease legibility. The BalancedRandom method has a much bet-ter precision/recall breakeven performance than the regular Random method, although it isstill matched and then outperformed by the active method. For classification accuracy, theBalancedRandom method initially has extremely poor performance (less than 50% which is

58

References

Page 281: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

(a) (b)

Figure 7: (a) Average test set accuracy over the ten most frequently occurring topics whenusing a pool sizes of 500 and 1000. (b) Average breakeven point over the tenmost frequently occurring topics when using a pool sizes of 500 and 1000.

even worse than pure random guessing) and is always consistently and significantly out-performed by the active method. This indicates that the performance gains of the activemethods are not merely due to their ability to bias the class of the instances they queries.The active methods are choosing special targeted instances and approximately half of theseinstances happen to have positive labels.

Figures 7a and 7b show the average accuracy and breakeven point of the Ratio methodwith two different pool sizes. Clearly the Random sampling method’s performance will not beaffected by the pool size. However, the graphs indicate that increasing the pool of unlabeleddata will improve both the accuracy and breakeven performance of active learning. This isquite intuitive since a good active method should be able to take advantage of a larger poolof potential queries and ask more targeted questions.

We also investigated active learning in a transductive setting. Here we queried thepoints as usual except now each method (Simple and Random) returned a transductiveSVM trained on both the labeled and remaining unlabeled data in the pool. As describedby Joachims (1998) the breakeven point for a TSVM was computed by gradually alteringthe number of unlabeled instances that we wished the TSVM to label as positive. Thisinvovles re-learning the TSVM multiple times and was computationally intensive. Sinceour setting was transduction, the performance of each classifier was measured on the poolof data rather than a separate test set. This reflects the relevance feedback transductiveinference example presented in the introduction.

Figure 8 shows that using a TSVM provides a slight advantage over a regular SVM inboth querying methods (Random and Simple) when comparing breakeven points. However,the graph also shows that active learning provides notably more benefit than transduction—indeed using a TSVM with a Random querying method needs over 100 queries to achieve

59

References

Page 282: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

Inductive PassiveTransductive PassiveInductive ActiveTransductive Active

20 40 60 80 100Labeled Training Set Size

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Pre

cisi

on/R

ecal

l Bre

akev

en P

oint

Transductive ActiveInductive ActiveTransductive PassiveInductive Passive

Figure 8: Average pool set precision/recall breakeven point over the ten most frequentlyoccurring topics when using a pool size of 1000.

RandomSimpleMaxMinRatio

0 20 40 60 80 100Labeled Training Set Size

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Tes

t Set

Acc

urac

y

FullRatioMaxMinSimpleRandom

RatioMaxMinSimpleRandom

0 20 40 60 80 100Labeled Training Set Size

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Tes

t Set

Acc

urac

y

FullRatioMaxMinSimpleRandom

(a) (b)

Figure 9: (a) Average test set accuracy over the five comp.∗ topics when using a pool sizeof 500. (b) Average test set accuracy for comp.sys.ibm.pc.hardware with a 500pool size.

the same breakeven performance as a regular SVM with a Simple method that has only seen20 labeled instances.

5.2 Newsgroups Experiments

Our second data collection was K. Lang’s Newsgroups collection Lang (1995). We used thefive comp.∗ groups, discarding the Usenet headers and subject lines. We processed the textdocuments exactly as before, resulting in vectors of about 10000 dimensions.

60

References

Page 283: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

(a) (b)

Figure 10: (a) A simple example of querying unlabeled clusters. (b) Macro-average testset accuracy for comp.os.ms-windows.misc and comp.sys.ibm.pc.hardware whereHybrid uses the Ratio method for the first ten queries and Simple for the rest.

We placed half of the 5000 documents aside to use as an independent test set, andrepeatedly, randomly chose a pool of 500 documents from the remaining instances. Weperformed twenty runs for each of the five topics and averaged the results. We used testset accuracy to measure performance. Figure 9a contains the learning curve (averagedover all of the results for the five comp.∗ topics) for the three active learning methodsand Random sampling. Again, the horizontal line indicates the performance of an SVMthat has been trained on the entire pool. There is no appreciable difference between theMaxMin and Ratio methods but, in two of the five newsgroups (comp.sys.ibm.pc.hardwareand comp.os.ms-windows.misc) the Simple active learning method performs notably worsethan the MaxMin and Ratio methods. Figure 9b shows the average learning curve for thecomp.sys.ibm.pc.hardware topic. In around ten to fifteen per cent of the runs for both ofthe two newsgroups the Simple method was misled and performed extremely poorly (forinstance, achieving only 25% accuracy even with fifty training instances, which is worsethan just randomly guessing a label!). This indicates that the Simple querying method maybe more unstable than the other two methods.

One reason for this could be that the Simple method tends not to explore the featurespace as aggressively as the other active methods, and can end up ignoring entire clustersof unlabeled instances. In Figure 10a, the Simple method takes several queries before iteven considers an instance in the unlabeled cluster while both the MaxMin and Ratio querya point in the unlabeled cluster immediately.

While MaxMin and Ratio appear more stable they are much more computationally in-tensive. With a large pool of s instances, they require about 2s SVMs to be learned for eachquery. Most of the computational cost is incurred when the number of queries that havealready been asked is large. The reason is that the cost of training an SVM grows polynomi-ally with the size of the labeled training set and so now training each SVM is costly (taking

61

References

Page 284: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

Query Simple MaxMin Ratio Hybrid

1 0.008 3.7 3.7 3.75 0.018 4.1 5.2 5.210 0.025 12.5 8.5 8.520 0.045 13.6 19.9 0.04530 0.068 22.5 23.9 0.07350 0.110 23.2 23.3 0.115100 0.188 42.8 43.2 0.2

Table 3: Typical run times in seconds for the Active methods on the Newsgroups dataset

over 20 seconds to generate the 50th query on a Sun Ultra 60 450Mhz workstation with apool of 500 documents). However, when the quantity of labeled data is small, even witha large pool size, MaxMin and Ratio are fairly fast (taking a few seconds per query) sincenow training each SVM is fairly cheap. Interestingly, it is in the first ten queries that theSimple seems to suffer the most through its lack of aggressive exploration. This motivatesa Hybrid method. We can use MaxMin or Ratio for the first few queries and then use theSimple method for the rest. Experiments with the Hybrid method show that it maintainsthe stability of the MaxMin and Ratio methods while allowing the scalability of the Simplemethod. Figure 10b compares the Hybrid method with the Ratio and Simple methods onthe two newsgroups for which the Simple method performed poorly. The test set accuracyof the Hybrid method is virtually identical to that of the Ratio method while the Hybridmethod’s run time was about the same as the Simple method, as indicated by Table 3.

6. Related Work

There have been several studies of active learning for classification. The Query by Com-mittee algorithm (Seung et al., 1992, Freund et al., 1997) uses a prior distribution overhypotheses. This general algorithm has been applied in domains and with classifiers forwhich specifying and sampling from a prior distribution is natural. They have been usedwith probabilistic models (Dagan and Engelson, 1995) and specifically with the Naive Bayesmodel for text classification in a Bayesian learning setting (McCallum and Nigam, 1998).The Naive Bayes classifier provides an interpretable model and principled ways to incorpo-rate prior knowledge and data with missing values. However, it typically does not performas well as discriminative methods such as SVMs, particularly in the text classification do-main (Joachims, 1998, Dumais et al., 1998).

We re-created McCallum and Nigam’s (1998) experimental setup on the Reuters-21578corpus and compared the reported results from their algorithm (which we shall call the MN-algorithm hereafter) with ours. In line with their experimental setup, queries were askedfive at a time, and this was achieved by picking the five instances closest to the currenthyperplane. Figure 11a compares McCallum and Nigam’s reported results with ours. Thegraph indicates that the Active SVM performance is significantly better than that of theMN-algorithm.

An alternative committee approach to query by committee was explored by Liere andTadepalli (1997, 2000). Although their algorithm (LT-algorithm hereafter) lacks the the-

62

References

Page 285: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

0 50 100 150 200Labeled Training Set Size

20

40

60

80

100P

reci

sion

/Rec

all B

reak

even

poi

nt

SVM Simple ActiveMN−Algorithm

150 300 450 600 750 900Labeled Training Set Size

60

70

80

90

100

Tes

t Set

Acc

urac

y

SVM Simple ActiveSVM PassiveLT−Algorithm Winnow ActiveLT−Algorthm Winnow Passive

(a) (b)

Figure 11: (a) Average breakeven point performance over the Corn, Trade and Acq Reuters-21578 categories. (b) Average test set accuracy over the top ten Reuters-21578categories.

oretical justifications of the Query by Committee algorithm, they successfully used theircommittee based active learning method with Winnow classifiers in the text categorizationdomain. Figure 11b was produced by emulating their experimental setup on the Reuters-21578 data set and it compares their reported results with ours. Their algorithm doesnot require a positive and negative instance to seed their classifier. Rather than seedingour Active SVM with a positive and negative instance (which would give the Active SVMan unfair advantage) the Active SVM randomly sampled 150 documents for its first 150queries. This process virtually guaranteed that the training set contained at least one posi-tive instance. The Active SVM then proceeded to query instances actively using the Simplemethod. Despite the very naive initialization policy for the Active SVM, the graph showsthat the Active SVM accuracy is significantly better than that of the LT-algorithm.

Lewis and Gale (1994) introduced uncertainty sampling and applied it to a text domainusing logistic regression and, in a companion paper, using decision trees (Lewis and Catlett,1994). The Simple querying method for SVM active learning is essentially the same as theiruncertainty sampling method (choose the instance that our current classifier is most uncer-tain about), however they provided substantially less justification as to why the algorithmshould be effective. They also noted that the performance of the uncertainty samplingmethod can be variable, performing quite poorly on occasions.

Two other studies (Campbell et al., 2000, Schohn and Cohn, 2000) independently de-veloped our Simple method for active learning with support vector machines and provideddifferent formal analyses. Campbell, Cristianini and Smola extend their analysis for theSimple method to cover the use of soft margin SVMs (Cortes and Vapnik, 1995) with lin-early non-separable data. Schohn and Cohn note interesting behaviors of the active learningcurves in the presence of outliers.

63

References

Page 286: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

7. Conclusions and Future Work

We have introduced a new algorithm for performing active learning with SVMs. By takingadvantage of the duality between parameter space and feature space, we arrived at threealgorithms that attempt to reduce version space as much as possible at each query. Wehave shown empirically that these techniques can provide considerable gains in both theinductive and transductive settings—in some cases shrinking the need for labeled instancesby over an order of magnitude, and in almost all cases reaching the performance achievableon the entire pool having seen only a fraction of the data. Furthermore, larger pools ofunlabeled data improve the quality of the resulting classifier.

Of the three main methods presented, the Simple method is computationally the fastest.However, the Simple method seems to be a rougher and more unstable approximation, aswe witnessed when it performed poorly on two of the five Newsgroup topics. If asking eachquery is expensive relative to computing time then using either the MaxMin or Ratio may bepreferable. However, if the cost of asking each query is relatively cheap and more emphasisis placed upon fast feedback then the Simple method may be more suitable. In either case,we have shown that the use of these methods for learning can substantially outperformstandard passive learning. Furthermore, experiments with the Hybrid method indicate thatit is possible to combine the benefits of the Ratio and Simple methods.

The work presented here leads us to many directions of interest. Several studies havenoted that gains in computational speed can be obtained at the expense of generalizationperformance by querying multiple instances at a time (Lewis and Gale, 1994, McCallumand Nigam, 1998). Viewing SVMs in terms of the version space gives an insight as to wherethe approximations are being made, and this may provide a guide as to which multipleinstances are better to query. For instance, it is suboptimal to query two instances whoseversion space hyperplanes are fairly parallel to each other. So, with the Simple method,instead of blindly choosing to query the two instances that are the closest to the currentSVM, it may be better to query two instances that are close to the current SVM and whosehyperplanes in the version space are fairly perpendicular. Similar tradeoffs can be made forthe Ratio and MaxMin methods.

Bayes Point Machines (Herbrich et al., 2001) approximately find the center of mass ofthe version space. Using the Simple method with this point rather than the SVM point inthe version space may produce an improvement in performance and stability. The use ofMonte Carlo methods to estimate version space areas may also give improvements.

One way of viewing the strategy of always choosing to halve the version space is that wehave essentially placed a uniform distribution over the current space of consistent hypothesesand we wish to reduce the expected size of the version space as fast as possible. Ratherthan maintaining a uniform distribution over consistent hypotheses, it is plausible thatthe addition of prior knowledge over our hypotheses space may allow us to modify ourquery algorithm and provided us with an even better strategy. Furthermore, the PAC-Bayesian framework introduced by McAllester (1999) considers the effect of prior knowledgeon generalization bounds and this approach may lead to theoretical guarantees for themodified querying algorithms.

Finally, the Ratio and MaxMin methods are computationally expensive since they haveto step through each of the unlabeled data instances and learn an SVM for each possible

64

References

Page 287: A quick introduction to Statistical Learning Theory and Support Vector Machines

SVM Active Learning with Applications to Text Classification

labeling. However, the temporarily modified data sets will only differ by one instance fromthe original labeled data set and so one can envisage learning an SVM on the original dataset and then computing the “incremental” updates to obtain the new SVMs (Cauwenberghsand Poggio, 2001) for each of the possible labelings of each of the unlabeled instances. Thus,one would hopefully obtain a much more efficient implementation of the Ratio and MaxMinmethods and hence allow these active learning algorithms to scale up to larger problems.

Acknowledgments

This work was supported by DARPA’s Information Assurance program under subcontractto SRI International, and by ARO grant DAAH04-96-1-0341 under the MURI program“Integrated Approach to Intelligent Systems”.

References

C. J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Miningand Knowledge Discovery, 2:121–167, 1998.

C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers. InProceedings of the Seventeenth International Conference on Machine Learning, 2000.

G Cauwenberghs and T. Poggio. Incremental and decremental support vector machinelearning. In Advances in Neural Information Processing Systems, volume 13, 2001.

C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1–25, 1995.

I. Dagan and S. Engelson. Committee-based sampling for training probabilistic classifiers.In Proceedings of the Twelfth International Conference on Machine Learning, pages 150–157. Morgan Kaufmann, 1995.

S.T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms andrepresentations for text categorization. In Proceedings of the Seventh International Con-ference on Information and Knowledge Management. ACM Press, 1998.

Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query bycommittee algorithm. Machine Learning, 28:133–168, 1997.

D. Heckerman, J. Breese, and K. Rommelse. Troubleshooting Under Uncertainty. TechnicalReport MSR-TR-94-07, Microsoft Research, 1994.

R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. Journal of MachineLearning Research, pages 245–279, 2001.

E. Horvitz and G. Rutledge. Time dependent utility and action under uncertainty. InProceedings of the Seventh Conference on Uncertainty in Artificial Intelligence. MorganKaufmann, 1991.

T. Joachims. Text categorization with support vector machines. In Proceedings of theEuropean Conference on Machine Learning. Springer-Verlag, 1998.

65

References

Page 288: A quick introduction to Statistical Learning Theory and Support Vector Machines

Tong and Koller

T. Joachims. Making large-scale svm learning practical. In B. Scholkopf, C. Burges, andA. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press,1999a.

T. Joachims. Transductive inference for text classification using support vector machines.In Proceedings of the Sixteenth International Conference on Machine Learning, pages200–209. Morgan Kaufmann, 1999b.

K. Lang. Newsweeder: Learning to filter netnews. In International Conference on MachineLearning, pages 331–339, 1995.

Jean-Claude Latombe. Robot Motion Planning. Kluwer Academic Publishers, 1991.

D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. InProceedings of the Eleventh International Conference on Machine Learning, pages 148–156. Morgan Kaufmann, 1994.

D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In Proceed-ings of the Seventeenth Annual International ACM-SIGIR Conference on Research andDevelopment in Information Retrieval, pages 3–12. Springer-Verlag, 1994.

D. McAllester. PAC-Bayesian model averaging. In Proceedings of the Twelfth AnnualConference on Computational Learning Theory, 1999.

A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classificationand clustering. www.cs.cmu.edu/˜mccallum/bow, 1996.

A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classi-fication. In Proceedings of the Fifteenth International Conference on Machine Learning.Morgan Kaufmann, 1998.

T. Mitchell. Generalization as search. Artificial Intelligence, 28:203–226, 1982.

J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMARTretrieval system: Experiments in automatic document processing. Prentice-Hall, 1971.

G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. InProceedings of the Seventeenth International Conference on Machine Learning, 2000.

Fabrizio Sebastiani. Machine learning in automated text categorisation. Technical ReportIEI-B4-31-1999, Istituto di Elaborazione dell’Informazione, 2001.

H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings ofComputational Learning Theory, pages 287–294, 1992.

J. Shawe-Taylor and N. Cristianini. Further results on the margin distribution. In Pro-ceedings of the Twelfth Annual Conference on Computational Learning Theory, pages278–285, 1999.

V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, 1982.

V. Vapnik. Statistical Learning Theory. Wiley, 1998.

66

References