7. support vector machines (svms) basic idea: 1.transform the data with a non-linear mapping so...

7. Support Vector Machines (SVMs)

Basic Idea:

1. Transform the data with a non-linear mapping so that it is linearly separable. Cf Cover’s theorem: non-linearly separable data can be transformed into a new feature space which is linearly separable if 1) mapping is non-linear 2) dimensionality of feature space is high enough

2. Construct the ‘optimal’ hyperplane (linear weighted sum of outputs of first layer) which maximises the degree of separation (the margin of separation: denoted by ) between the 2 classes

MLPs and RBFN stop training when all points are classified correctly. Thus the decision surfaces are not optimised in the

sense that the generalization error is not minimized

MLP

RBF

SVM

xm0

Input: m-D vector

x1

x2

First layer: mapping performed from the input space into a feature space of higher dimension where the data is now linearly separable using a set of m1 non-linear functions (cf RBFNs)

m1x)

wm1y = Outputy = wi xi) +by = wT

x)

b= w0 = bias

1. After learning both RBFN and MLP decision surfaces might not be at the optimal position. For example, as shown in the figure, both learning rules will not perform further iterations (learning) since the error criterion is satisfied (cf perceptron)

2. In contrast the SVM algorithm generates the optimal decision boundary (the dotted line) by maximising the distance between the classes which is specified by the distance between the decision boundary and the nearest data points

3. Points which lie exactly away from the decision boundary are known as Support Vectors

4. Intuition is that these are the most important points since moving them moves the decision boundary

Moving a support vector moves the decision boundary

Moving the other vectors has no effect

The algorithm to generate the weights proceeds in such a way that only the support vectors determine the weights and thus the boundary

xm0

Input: m-D vector

x1

x2

KNx) K(x, xN)

= T(x). xN)

aNdN

y = Outputy = ai di K(x, xi) + b

b = bias

However, we shall see that the output of the SVM can also be interpreted as a weighted sum of the inner (dot) products of the

images of the input x and the support vectors xi in the feature space, which is computed by an inner product kernel function K(x,xm)

Where: T(x) = [1(x), 2(x), .. , m1(x)]T I.e. image of x in feature space

and di = +/- 1 depending on the class of xi

Why should inner product kernels be involved in pattern recognition?

-- Intuition is that they provide some measure of similarity

-- cf Inner product in 2D between 2 vectors of unit length returns the cosine of the angle between them.

e.g. x = [1, 0]T , y = [0, 1]T

I.e. if they are parallel inner product is 1

xT x = x.x = 1

If they are perpendicular inner product is 0

xT y = x.y = 0

Differs to MLP (etc) approaches in a fundamental way

• In MLPs complexity is controlled by keeping number of hidden nodes small

• Here complexity is controlled independently of dimensionality

• The mapping means that the decision surface is constructed in a very high (often infinite) dimensional space

• However, the curse of dimensionality (which makes finding the optimal weights difficult) is avoided by using the notion of an inner product kernel (see: the kernel trick, later) and optimising the weights in the input space

SVMs are a superclass of network containing both MLPs and RBFNs (and both can be generated using the SV

algorithm)

Strengths:

Previous slide: i.e. complexity/capacity is independent of dimensionality of the data thus avoiding curse of dimensionality

Statistically motivated => Can get bounds on the error, can use the theory of VC dimension and structural risk minimisation (theory which characterises generalisation abilities of learning machines)

Finding the weights is a quadratic programming problem guaranteed to find a minimum of the error surface. Thus the algorithm is efficient and SVMs generate near optimal classification and are insensitive to overtraining

Obtain good generalisation performance due to high dimension of feature space

Most important (?): by using a suitable kernel, SVM automatically computes all network parameters for that kernel. Eg RBF SVM: automatically selects the number and position of hidden nodes (and weights and bias)

Weaknesses:

Scale (metric) dependent

Slow training (compared to RBFNs/MLPs) due to computationally intensive solution to QP problem especially for large amounts of training data => need special algorithms

Generates complex solutions (normally > 60% of training points are used as support vectors) especially for large amounts of training data. E.g. from Haykin: increase in performance of 1.5% over MLP. However, MLP used 2 hidden nodes, SVM used 285

Difficult to incorporate prior knowledge

The SVM was proposed by Vapnik and colleagues in the 70’s but has only recently become popular early 90’s). It (and other kernel techniques) is currently a very active (and trendy) topic of research

See for example:

http://www.kernel-machines.org

or (book):

AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods). N. Cristianini and J. Shawe-Taylor, Cambridge University Press. 2000. ISBN: 0 521 78019 5

for recent developments

First consider a linearly separable problem where the decision boundary is given by

g(x) = wTx+ b = 0

And a set of training data X={(xi,di): i=1, .., N} where di = +1 if xi is in class 1 and –1 if it’s in class 2. Let the optimal weight-bias combination be w0 and b0

xp

xnx

w

Now: x = xp + xn = xp + r w0 / ||w0|| where: r = ||xn||

Since: g(xp) = 0, g(x) = w0T(xp + r w0 / ||w0||) + b0

g(x) = r w0T w0 / ||w0|| = r ||w0||

or: r = g(x)/ ||w0||

Thus, as g(x) gives us the algebraic distance to the hyperplane, we want: g(xi) = w0

Txi + b0 >= 1 for di = 1

and g(xi) = w0Txi + b0 <= -1 for di = -1

(remembering that w0 and b0 can be rescaled without changing the

boundary) with equality for the support vectors xs. Thus, considering points on the boundary and that:

r = g(x)/ ||w0||

we have:

r = 1/ ||w0|| for dS = 1 and r = -1/ ||w0|| for dS = -1

and so the margin of separation is:

= 2 / ||w0||

• Thus, the solution w0 maximises the margin of separation

• Maximising this margin is equivalent to minimising ||w||

We now need a computationally efficient algorithm to find w0 and b0

using the training data (xi, di). That is we want to minimise:

F(w) = 1/2 wTw

subject to: di(wTxi + b) >= 1 for i= 1, .. N

which is known as the primal problem. Note that the cost function F is convex in w (=> a unique solution) and that the constraints are linear in w.

Thus we can solve for w using the technique of Lagrange multipliers (non-maths: technique for solving constrained optimisation problems). For a geometrical interpretation of Lagrange multipliers see Bishop, 95, Appendix C.

First we construct the Lagrangian function:

L(w , a) = 1/2 wTw - iai [di(wTxi + b) - 1]

where ai are the Lagrange multipliers. L must be minimised with respect to w and b and maximised with respect to ai (it can be shown that such problems have a saddle-point at the optimal solution). Note that the Karush-Kuhn-Tucker (or, intuitively, the maximisation/constraint) conditions means that at the optimum:

ai [di(wTxi + b) - 1] = 0

This means that unless the data point is a support vector ai = 0 and the respective points are not involved in the optimisation.

We then set the partial derivatives of L wr to b and w to zero to obtain the conditions of optimality:

w = iai di xi

and iai di = 0

Given such a constrained convex problem, we can reform the primal problem using the optimality conditions to get the equivalent dual problem:

Given the training data sample {(xi,di), i=1, …,N}, find the Lagrangian multipliers ai which maximise:

subject to the constraints:

and ai >0

Notice that the input vectors are only involved as an inner product

N

ij

Tijij

N

ji

N

ii xxddaaaaQ

1 11 2

1)(

01

i

N

iida

Once the optimal Lagrangian multipliers a0, i have been found we can use them to find the optimal w:

w0 = a0,i di xi

and the optimal bias from the fact that for a positive support vector:

wTxi + b0 = 1

=> b0 = 1 - w0Txi

[However, from a numerical perspective it is better to take the mean value of b0 resulting from all such data points in the sample]

Since a0,i = 0 if xi is not a support vector, ONLY the support vectors determine the optimal hyperplane which was our intuition

For a non-linearly separable problem we have to first map data onto feature space so that they are linear separable

xixi)

with the procedure for determining w the same except that xi is replaced by xi) that is:

Given the training data sample {(xi,di), i=1, …,N}, find the optimum values of the weight vector w and bias b

w = a0,i di xi)

where a0,i are the optimal Lagrange multipliers determined by maximising the following objective function

subject to the constraints

ai di =0 ; ai >0

N

iji

Tjij

N

ji

N

ii xxddaaaaQ

1 11

)()(2

1)(

Example XOR problem revisited:

Let the nonlinear mapping be :

(x) = (1,x12, 21/2 x1x2, x2

2, 21/2 x1 , 21/2 x2)T

And: (xi)=(1,xi12, 21/2 xi1xi2, xi2

2, 21/2 xi1 , 21/2 xi2)T

Therefore the feature space is in 6D with input data in 2D

x1 = (-1,-1), d1= - 1 x2 = (-1,1), d2= 1 x3 = (1,-1), d3= 1 x4 = (-1,-1), d4= -1

Q(a)= ai – ½ ai aj di dj xi) Txj)

= -1/2 (1+1+2+1+2+2) a1 a1

+1/2 (1+1-2+1+2-2)a1 a2 +…

+a1 +a2 +a3 +a4

=a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4

+9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a4 +9 a4 a4 )

To minimize Q, we only need to calculate Partial Q /partial ai = 0 (due to optimality conditions) which gives

1 = 9 a1 - a2 - a3 + a4

1 = -a1 + 9 a2 + a3 - a4

1 = -a1 + a2 + 9 a3 - a4

1 = a1 - a2 - a3 + 9 a4

The solution of which gives the optimal values: a0,1 =a0,2 =a0,3 =a0,4 =1/8

w0 = a0,i di xi) = 1/8[x1)- x2)- x3)+ x4)]

0

0

02

10

0

2

2

1

2

1

1

2

2

1

2

1

1

2

2

1

2

1

1

2

2

1

2

1

1

8

1

Where the first element of w0 gives the bias b

From earlier we have that the optimal hyperplane is defined by:

w0T x) = 0

That is:

0

2

2

2

1

0002

100 21

2

1

22

21

21

xx

x

x

x

xx

x

w0T x)

which is the optimal decision boundary for the XOR problem. Furthermore we note that the solution is unique since the optimal decision boundary is unique

Output for polynomial RBF

SVM building procedure:1. Pick a nonlinear mapping 2. Solve for the optimal weight vector

However: how do we pick the function

• In practical applications, if it is not totally impossible to find it is very hard

• In the previous example, the function is quite complex: How would we find it?

Answer: the Kernel Trick

Notice that in the dual problem the image of input vectors only involved as an inner product meaning that the optimisation can be performed in the (lower dimensional) input space and that the inner product can be replaced by an inner-product kernel

Q(a) = ai – ½ ai aj di dj xi) T xj)

= ai – ½ ai aj di dj xixj

How do we relate the output of the SVM to the kernel K?

Look at the equation of the boundary in the feature space and use the optimality conditions derived from the Lagrangian formulations

)()(),(:

),()(

0),(:

0)()(:

)(:

0)(:

)](),...,(),([)(:

1)(;0)(

0)(

1

0

1

1

1

1

110

0

1

0

1

1

ijj

m

ji

ii

N

ii

T

ii

N

ii

iT

i

N

ii

ii

N

ii

T

m

j

m

jj

j

m

jj

xxxxKwhere

xxKdaxwOutputand

xxKdaisboundarysoand

xxdaThus

xdawconditionsoptimalityfrom

xwgetwe

xxxxwriting

xwherexwor

bxwbydefinedisHyperplane

In the XOR problem, we chose to use the kernel function:K(x, xi) = (x T

xi+1)2

= 1+ x12 xi1

2 + 2 x1x2 xi1xi2 + x22 xi2

2 + 2x1xi1 ,+ 2x2xi2

Which implied the form of our nonlinear functions:(x) = (1,x1

2, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T

And: (xi)=(1,xi12, 21/2 xi1xi2, xi2

2, 21/2 xi1 , 21/2 xi2)T

However, we did not need to calculate at all and could simply have used the kernel to calculate:

Q(a) = ai – ½ ai aj di dj xixj

Maximised and solved for ai and derived the hyperplane via:

0),(1

ii

N

ii xxKda

We therefore only need a suitable choice of kernel function cf:Mercer’s Theorem:

Let K(x,y) be a continuous symmetric kernel that defined in the closed interval [a,b]. The kernel K can be expanded in the form

(x,y) = x) T y)

provided it is positive definite. Some of the usual choices for K are:

Polynomial SVM (x T xi+1)p p specified by user

RBF SVM exp(-1/(2) || x – xi||2) specified by user

MLP SVM tanh(s0 x T xi + s1) Mercer’s theorem not

satisfied for all s0 and s1

How to recover from a given K ???? Not essential that we do… Further development

1. In practical applications, it is found that the support vector machine can outperform other learning machines2. How to choose the kernel?3. How much better is the SVM compared with traditional

machine?

Feng J., and Williams P. M. (2001) The generalization error of the symmetric and scaled support vector machine IEEE T. Neural Networks Vol. 12, No. 5. 1255-1260

http://www.cogs.susx.ac.uk/users/jianfeng/ssvm.ps

What about regularisation?

Important that we don’t allow noise to spoil our generalisation: we want a soft margin of separation

Introduce slack variables ei >= 0 such that:

di(wTxi + b) >= 1 – ei for i= 1, .. N

Rather than: di(wTxi + b) >= 10 < ei <= 1

ei > 1

ei = 0

But all 3 are support vectors since di(wTxi + b) = 1 – ei

Thus the slack variables measure our deviation from the ideal pattern separability and also allow us some freedom in specifying the hyperplane

Therefore formulate new problem to minimise:

F(w, e ) = 1/2 wTw + C ei

subject to:

di(wTxi + b) >= 1 for i= 1, .. N

And :

ei >= 0

Where C acts as a (inverse) regularisation parameter which can be determined experimentally or analytically.

The solution proceeds in the same way as before (Lagrangian, formulate dual and maximise) to obtain optimal ai for:

Q(a)= ai – ½ ai aj di dj xi, xj)

subject to the constraints

ai di =0 ;

0<= ai <= C

Thus, the nonseparable problem differs from the separable one only in that the second constraint is more stringent. Again the optimal solution is:

w0 = a0,i di xi)

However, this time the KKT conditions imply that:

ei = 0 if ai < C

SVMs for non-linear regression

SVMs can also be used for non-linear regression. However, unlike MLPs and RBFs the formulation does not follow directly from the classification case

Starting point: we have input data

X = {(x1,d1), …., (xN,dN)}

Where xi is in D dimensions and di is a scalar. We want to find a robust function f(x) that has at most deviation from the targets d, while at the same time being as flat (in the regularisation sense of a smooth boundary) as possible.

Thus setting:

f(x) = wT(x) + b

The problem becomes, minimise:

½ wTw (for flatness)

[think of gradient between (0,0) and (1,1) if weights are (1,1) vs (1000, 1000)]

Subject to:

di - wT(xi) + b <=

wT(xi) + b - di <=

L(f,y)

This formalisation is called -insensitive regression as it is equivalent to minimising the empirical risk (amount you might be wrong) using an -insensitive loss function:

L(f, d, x) = | f(x) – d | - for | f(x) – d | <

else

Comparing -insensitive loss function to least squares loss function (used for MLP/RBFN)

• More robust (robust to small changes in data/ model)

• Less sensitive to outliers

• Non-continuous derivative

Cost function is:

C i L(f, di, xi)

Where C can be viewed as a regularisation parameter

Regression for different function selected is the flattest

Original (O)

We now introduce 2 slack variables, ei and ei* as in the case of

nonlinearly separable data and write:

di - wT(xi) + b <= eiwT(xi) + b - di <= ei

*

Where: ei , ei* >= 0

Thus: C L(f, di, xi) = C ei + ei

*)

And the problem becomes to minimise:

F(w, e ) = 1/2 wTw + C ei + ei*)

subject to: di - wT(xi) + b <= eiwT(xi) + b - di <= ei

* And :

ei , ei* >= 0

We now form the Lagrangian, and find the dual. Note that this time, there will be 2 sets of Lagrangian multipliers as there are 2 constraints. The dual to be maximised is:

CaandCa

aad

tosubject

xxKaaaaaadaaQ

ii

N

iiii

j

N

i

N

jiii

N

iii

N

iiii

*

1

*

1 1

*

1

*

1

*

0,0

0)(

:

),()(2

1)()(*),(

Where and C are free parameters that control the approximating function:

f(x) = wT(x)

= iai – ai*) K x, xi)

From the KKT conditions we now have:

ai (ei - di + wTxi + b) = 0

ai* (ei* + di - wTxi - b) = 0

This means that the Lagrange multipliers will only be non-zero for points where:

| f(xi) – di | >=

That is, only for points outside the tube.

Thus these points are the support vectors and we have a sparse expansion of w in terms of x

SVsData points

Controls the amount of SVs selected

Only non-zero a’s can contribute: the Lagrange multipliers act like forces on the regression. However, they can only be applied at points outside or touching the tube

Points where forces act

One note of warning:

Regression is much harder than classification for 2 reasons

1. Regression is intrinsically more difficult than classification

2. and C must be tuned simultaneously

Research issues:

• Incorporation of prior knowledge e.g.

1. train a machine,

2. add in virtual support vectors which incorporate known invariances, of SVs found in 1.

3. retrain

• Speeding up training time?

Various techniques, mainly to deal with reducing the size of the data set. Chunking: use subsets of the data at a time and only keep SVs. Also, more sophisticated versions which use linear combinations of the training points as inputs

• Optimisation packages/techniques?

Off the shelf ones are not brilliant (including the MATLAB one). Sequential Minimal Opitmisation (SMO) widely used. For details of that and others see:

A. J. Smola and B. Schölkopf. A Tutorial on Support Vector Regression. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998.

• Selection of and C

http://www.kernel-machines.org/papers/tr-30-1998.ps.gz

7. support vector machines (svms) basic idea: 1.transform the data with a non-linear mapping so...

Documents