svm & image classification

8/12/2019 SVM & Image Classification.

1/22

C++ Project

Support Vector Machines& Image Classification

Authors:

Pascal-AdamSitbon

AlainSoltaniSupervisor:

Benot Patra

February 2014


2/22

Contents

Contents i

1 Support Vector Machines 1

1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Linearly separable set . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Nearly linearly separable set. . . . . . . . . . . . . . . . . . . . . . 4

1.2.3 Linearly inseparable set . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.3.1 The kernel trick . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.3.2 Classification : projection into a bigger space . . . . . . . 5

1.2.3.3 Mapping conveniently . . . . . . . . . . . . . . . . . . . . 6

1.2.3.4 Usual kernel functions . . . . . . . . . . . . . . . . . . . . 6

2 Computation under C++ 7

2.1 Librairies & datasets employed . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Project format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Two-class SVM implementation . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 First results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.2 Parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.2.1 Optimal training on parameter grid . . . . . . . . . . . . 8

2.3.2.2 Iterating and sharpening results . . . . . . . . . . . . . . 9

2.4 A good insight : testing on a small zone . . . . . . . . . . . . . . . . . . . 10

2.5 Central results : testing on a larger zone . . . . . . . . . . . . . . . . . . . 11

2.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.2 Case of an unreached minimum . . . . . . . . . . . . . . . . . . . . 13

2.6 Going further : enriching our model . . . . . . . . . . . . . . . . . . . . . 142.6.1 Case (A) : limited dataset . . . . . . . . . . . . . . . . . . . . . . . 14

2.6.2 Case (B) : richer dataset. . . . . . . . . . . . . . . . . . . . . . . . 15

2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A Unbalanced data set 17

A.1 Different costs for misclassification . . . . . . . . . . . . . . . . . . . . . . 17

B Multi-class SVM 19

B.1 One-versus-all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B.2 One-versus-one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Bibliography 20

i


3/22

Chapter 1

Support Vector Machines

1.1 Introduction

Support vector learning is based on simple ideas, which originated from statistical learn-ing theory[1]. Support vector machines are supervised learning models with associatedlearning algorithms that analyze data and recognize patterns.Basic SVM takes a set of input data and predicts, for each given input, which of the twopossible classes forms the output, making it a nonprobabilistic binary linear classifier.Given a set of training examples, each marked as belonging to one of two categories, anSVM training algorithm builds a model that assigns new examples into one category orthe other.

An SVM model is a representation of the examples as points in space, mapped so thatthe examples of the separate categories are divided by a clear gap that is as wide aspossible. New examples are then mapped into that same space and predicted to belongto a category based on which side of the gap they fall on.

1.2 Support Vector Machines

A data set containing points which belong to two different classes can be represented bythe following set :

D= {( xi, yi), 1 i m | i, yi {1; 1} , xi Rq} , (m, q) N2 (1.1)

where yi represents the belonging to one of the two classes, xi the training points, qthedimension of the data set.

One of the most important things we have to focus on is the shape of the data set.Our goal is to find the best way to distinguish between the two classes. Ideally, we wouldlike to have a linearly separabledata set - in which our two set of points can be fullyseparated by a line for a two-dimensional space, or a hyperplane for a n-dimensionalspace. However, this is not the case in general.

We will look in the following subsections at three possible configurations for our dataset.

1


4/22

Chapter 1. Introduction to Support Vector Machines 2

1.2.1 Linearly separable set

In the following example (Fig. 1.1), it is easy to see that the data points can be easilylinearly separated. Most of the time, with a big data set, its impossible to say just by

visualizing the data whether it can be linearly separated or not - even the data cannotbe visualized.

Figure 1.1: A simple linearly separable dataset.Blue points are labelled 1 ; red are labelled -1.

To solve the problem analytically, we have to define several new objects.

Definition. A linear separator is a functionfthat depends on two parameters w andb, given by the following formula :

fw,b(x) = w, x+ b, b R, w Rq. (1.2)

This separator can take more values than 1 and 1.Whenfw,b(x) 0, xwill belong to the class of vectors such that yi = 1 ; in the oppositecase, to the other class (i.e. such that yi = 1).The line of separationis the contour line defined by the equation fw,b(x) = 0.

Definition. The margin of an element( xi, yi), relatively to a separatorf, notedf

( xi,yi),

is the real given by :

f

( xi,yi)= f( xi) yi 0. (1.3)

Definition. The margin of a set of pointsD, relatively to a separatorf, is the minimumof the margins for all the elements inD :

fD = min

f

(xi,yi) | ( xi, yi) D

. (1.4)

Definition. The support vectors are the vectors such that :

f

(xi,yi) 1, i.e. yi( w, xi+ b) 1. (1.5)

The goal of the SVM is to maximize the margin of the data set.


5/22


Figure 1.2: Support vectors and minimal margin.The orange line represents the separation, while the pink and blue ones represents re-

spectively the hyperplans associated to the equationsfw,b(x) = 1 andfw,b(x) =1.

Lemma. The width of the band constituted by the hyperplansfw,b(x) = 1 andfw,b(x) =1 equals 2w .

Proof. Let ube a point of the contour line defined by fw,b(x) = 1.Let u be his orthogonal projection on the contour line fw,b(x) =1.Hence we have :

fw,b(u) fw,b( u) = 2

i.e. u u, w= 2.

Yet we have :i.e. u u, w=

u

u w .

Indeed, they are colinear, and have the same orientation.

Besides,u u

is equal to the width constituted by the two contour lines.

In order to find the best separator - i.e. the one providing the maximum margin - wehave to seek within the class of separators such that f

( xi,yi)>1, ( xi, yi) D and retain

the one for which w is minimal.

This leads us to solve the following constrained optimization problem :

minw,b

w2

2 (1.6)

under( xi, yi) D, f

(xi,yi)= yi( w, xi+ b)> 1.

NB. We minimize w2

2 for calculus purposes : derivations become easier ; besides, it isbetter to work with the square norm.

By introducing Lagrange multipliers i, the previous constrained problem can be ex-pressed as :

argminw,b

maxi0

w2

2

mi=1

i[yi( w, xi+ b)1] (1.7)

that is we look for a saddle point. In doing so all the points which can be separated asyi( w, xi+ b)1> 0 do not matter since we must set the corresponding i to zero.


6/22


This problem can now be solved by standard quadratic programming techniques.

1.2.2 Nearly linearly separable set

In this subsection, we will discuss the case of a nearly separable set - i.e. a dataset forwhich using a linear separator would be efficient enough. If there exists no hyperplanethat can split entirely the dataset, the following method - called soft margin method- will choose a hyperplane that splits the examples as cleanly as possible, while stillmaximizing the distance to the nearest cleanly split examples.

Let us modify the maximum margin idea to allow mislabeled examples to be treatedthe same way, by allowing points to have a margin which can be smaller than 1, evennegative.

The previous constraint in (1.6) now becomes :

( xi, yi) D, yi( w, xi+ b)> 1 i. (1.8)

where i 0 are called the slack variables, and measure the degree of misclassificationof the data xi.

The objective function we minimize has also to be changed : we increase it by a functionwhich penalizes non-zero i, and the optimization becomes a trade-off between a largemargin and a small error penalty.

If the penalty function is linear, the optimization problem becomes :

minw,b,

w2

2 + C

mi=1

i (1.9)

under( xi, yi) D, yi( w, xi+ b)> 1 i, i 0.

This constraint minimization problem above can be solved using Lagrange multipliersas done previously. We now solve the following problem :

argminw,b,

maxi,i0

w2

2 + C

mi=1

imi=1

i[yi( w, xi+ b)1 + i]mi=1

i (1.10)

with i, i 0.

1.2.3 Linearly inseparable set

We saw in the previous subsection that linear classification can read to misclassifications- this is especially true if the dataset D is not separable at all.

Let us consider the following example (Fig. 1.3).

For this set of data points, any linear classification would introduce too much misclas-sification to be considered as accurate enough.


7/22


Figure 1.3: Linearly inseparable set.Blue points are labelled 1 ; red are labelled -1.

1.2.3.1 The kernel trick

To solve our classification problem, let us introduce the kernel trick.For machine learning algorithms, the kernel trick is a way of mapping observations froma general data set S into an inner product space V, without having to compute themapping explicitly, such that the observations will have a meaningful linear structure inV.

Hence linear classifications in Vare equivalent to generic classifications in S. The trickor method used to avoid the explicit mapping is to use learning algorithms that onlyrequire dot productsbetween the vectors in V, and choose the mapping such that these

high-dimensional dot products can be computed within the original space, by means ofa certain kernel function - a function K : S2 V that can be expressed as an innerproduct.

1.2.3.2 Classification : projection into a bigger space

To understand the usefulness of the trick, lets go back to our classification problem.Let us consider a simple projection of vectors in D, our dataset, into a much richer,higher-dimension feature space.We project each point ofD in this bigger space and make a linear separation there.

Lets name p this projection :

( xi, yi) D, p( xi) =

p1( xi)...

pn( xi)

as we express the projected vector p( xi) in a base of the n-dimensional new space.

This point of view can lead to problems, because n can grow without any limit, andnothing assures us that thepi are linear in the vectors. Following the same method thanabove would imply to work on a new set D :

D =p(D) ={(p( xi), yi), 1 i m | i, yi {1; 1} , xi Rq} , (m, q) N2 (1.11)


8/22


Because it implies to calculate p for each vector ofD, this method will be never used.

1.2.3.3 Mapping conveniently

Lets first notice that its not necessary to calculate p, as the optimization problem onlyinvolves inner products between the different vectors. We can now consider the kerneltrick approach.

We construct :K :D2 V

such as K(x, z) =p(x), p(z), (x, yx), (z, yz) D (1.12)

making sure that it corresponds to a projection in the unknown space V.

We then avoid the calculus of p, and the description of the space in which we are

projecting. The optimization problem remains the same, through replacing ., . byk(., .):

minw,b,

k( w, w)

2 + C

mi=1

i (1.13)

under( xi, yi) D, yi(k( w, xi) + b)> 1 i, i 0.

1.2.3.4 Usual kernel functions

Polynomial :

K(x, z) = (xT

z+ c)d

where c 0 is a constant trading off the influence of higher-order versus lower-orderterms in the polynomial. Polynomials such that c= 0 are called homogeneous.

Gaussian radial basis (RBF) :

K(x, z) = exp(x z2), >0.

Sometimes parametrized using = 122

.

Hyperbolic tangent :

K(x, z) = tanh(xTz+ c), for >0, c


9/22

Chapter 2

Computation under C++

2.1 Librairies & datasets employed

We used for this project the computer vision and machine learning library OpenCV.All its SVM features are based on the specific library LibSVM,by Chih-Chung Changand Chih-Jen Lin.

We trained our models on the Image Classification Dataset from Andrea Vedaldi andAndrew Zissermans Oxford assignment. It includes five different image classes - aero-planes, motorbikes, people, horses and cars - of various sizes, and pre-computed featurevectors, in form of a sequence of consecutive 6-digit values. Pictures used are all colorimages in .jpg format, of various dimensions.

The dataset can be downloaded at :http://www.robots.ox.ac.uk/ ~vgg/share/practical-image-classification.htm.

2.2 Project format

The C++ project itself possesses 4 branches, for opening, saving, training & testingphases. In its original form, it allows opening two training files and a testing one, on auser-friendly, console-input base. User enters files directories, format used and labels forthe different training classes. For the testing phase, a label is asked, so results obtained

via the SVM classification can be compared with the prior label given by user ; the lattercan directly see the misclassification results - rate, number of misclassified files - in theconsole output.The user can either choose its own kernel type, parameter values, or let the computerrun the optimal one ; classes have been created in consequence. Following results havebeen obtained using this program and additional versions (especially when includingmultiple training files) that derive directly from it ; these will not be presented here.The project can be found on GitHubat :https://github.com/Parveez/CPP_Project_ENSAE_2013 .

7
http://opencv.org/about.htmlhttp://www.csie.ntu.edu.tw/~cjlin/libsvm/http://www.csie.ntu.edu.tw/~cjlin/libsvm/http://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htmhttp://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htmhttp://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htmhttp://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htmhttp://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htmhttp://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htmhttps://github.com/Parveez/CPP_Project_ENSAE_2013https://github.com/Parveez/CPP_Project_ENSAE_2013https://github.com/Parveez/CPP_Project_ENSAE_2013http://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htmhttp://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htmhttp://www.csie.ntu.edu.tw/~cjlin/libsvm/http://opencv.org/about.html


10/22

Chapter 2. Computation under C++ 8

2.3 Two-class SVM implementation

2.3.1 First results

We first trained our SVM with training sets aeroplane train.txt and horse train.txt ; the data tested was contained in aeroplane val.txt and horse val.txt.As the images included in the two training classes may vary in size, we all resized themto a unique testing zone ; same goes for the testing set. All images are stored in twomatrices - one for the training phase, one for the testing phase : each matrix row is apoint (here, an image), and all its coefficients are features (here, pixels). For example,for 251 training images, all of size 50x50 pixels, the training matrix will be of dimensions251x2500.

For a 50x50 pixel zone, with respectively 112 and 139 elements in each class, learning timeamounts to 0.458 seconds ; testing time, for 274 elements, amounts to 11.147 seconds.

But a classifier of any type produces bad results for randomly-assigned parameter values :for example, with the default value assigned to Cand, a gaussian classifier misclassifies126 elements of the aeroplane val.txt file.The following section discuss the optimal selection of the statmodel parameters.

2.3.2 Parameter selection

2.3.2.1 Optimal training on parameter grid

The effectiveness of SVM depends on the selection of kernel, the kernels parameters,

and soft margin parameter C.The best combination is here selected by a grid search with multiplicative growing se-quences of the parameter, given a certain step.Input parameters for the parameter selection are : min val,max val the extremal values tested step the step parameter.

Parameter values are tested through the following iteration sequence :

(min val,min valstep, ...,min valstepn)

with n such that min valstepn


11/22


2.3.2.2 Iterating and sharpening results

Even if results are improved by the use of a parameter grid, refinements can be added.Indeed, we sharpen our estimation by computing iterative parameter selection - each

time on smaller grids :

Data: Default inital gridResult: Optimal parameter for SVM trainingwhile iterations under threshold do

train SVM on grid through cross-validation ;return best parameter;set parameter= best parameter;re-centergrid;diminish grid size;

end

Algorithm 1: Basic iterative parameter testing.

One can initially think of :

max val(j) =max val(j) max val(j1)param(j)

2

min val(j) =min val(j) + param(j)min val(j1)

2

step(j) = step(j1)

2

to implement the grid resizing at the step j , withparam(j) best parameter value obtainedafter training the SVM model.Yet such recursion is not properly efficient : as j grows, the calculation time growsvery fastly. Indeed, as step gets smaller, the number of iterations to reach max valueincreases very easily.

As we usually initialize C and grid extremal values at different powers of ten, withstep(0) = 10, a convient way to resize the grid at step j is the following :

max val(j) =param(j)101

2j1+1012j

2

min val(j) =param(j)10 12j

step(j) =

step(j1) =... = 1012j

as we can express min val, max val using powers of ten after replacing step(j).

It only takes a couple of iterations to go through the grid, and produces equivalent orbetter results. Besides, the more precise the estimation of the parameters, the faster theiteration.


12/22


2.4 A good insight : testing on a small zone

We first sought results for a small zone of 50x50 pixels, to get a primary overview ofhow our algorithms works.

For such zone, and the following initial grid and characteristics1 :

Grid C

min val 107 103

max val 103 + 1010 107 + 103

Number of class 1 files 112

Number of class -1 files 139

Files tested 247

we obtained the following results :

No iterations nor grid usage(latest calculation time2 : 0.599 seconds):

default value 1

C default value 1

Files misclassified 126

Misclassification rate 0.459

After 1 iteration(latest calculation time : 11.691 seconds):

final value 107

C final value 1000Files misclassified 68


After 5 iterations(latest calculation time : 4.138 seconds):

final value 9.085108

C final value 90.851








1Again, we point our that RBF kernel type was not specified initially by the user, but chosen by the

program during paramater optimization.2Here represents the total calculation time - i.e. including training and testing time - for the last

iteration mentionned.


13/22


Figure 2.1: Values of C per iterations.

Figure 2.2: Values of 108 per iterations.

What can we surmise from those results ?Firstly, the number of misclassified images is improved by automatically training ourmodel on a grid.

Secondly, it is also improved by iterating the parameter selection process. Althoughdecay is slow, each iteration help our SVM classifying better the testing data.Lastly, calculation time seem to be globally lower iteration after iteration, in acceptableproportions considering the small size of our zone.

2.5 Central results : testing on a larger zone

2.5.1 Results

Let us now run training and testing on a larger zone of 300x300 zone, to gain bettercomprehension of our models behaviour.Parameter grids are initialized to the same values as in the previous subsection ; hereagain, RBF kernel is the optimal kernel type for the data.

No iterations nor grid usage(latest calculation time : 20.857 seconds):

default value 1

C default value 1

Files misclassified 126Misclassification rate 0.459


14/22



final value 107

C final value 1000












Figure 2.3: Number of misclassified images per iterations.

Figure 2.4: Values of C per iterations.Blue background, left : Normal scale. Red background, right : logarithmic scale.


15/22


Figure 2.5: Values of 1010 per iterations.Blue background, left : Normal scale. Red background, right : logarithmic scale.

2.5.2 Case of an unreached minimum

Here the most intriguing fact may be probably be that after 5 iterations, the number ofmisclassified files drops to 60 files out 274 tested, and raises to 62 the next step.This can be explained by the following fact : the point ((5), C(5)) is near the minimumvalue - i.e. the one providing the minimal misclassification rate- we are seeking, whichexact value cannot be reached through the grid at fifth step ; and as reposition ( , C)and resize the grid at ((5), C(5)), we might actually re-center the problem on a new areathat does not include the minimum at all.

Figure 2.6: Problem of the unreached minimum.Here the minimum is included in the upper-middle case of the grid at step 5. (Gamma,C) is the best approximation available over the grid, but shrinking the grid at this exact

point leaves the minimum off the new grid.

A solution to address this problem may be to have a smoother re-sizing algorithm, likethe first one we presented. But this may actually have a negative impact on calculationtime at each step.

For example, let us compare our results with those obtained with the initial, less-efficientre-sizing algorithm ; for the latter, with the same 300x300 pixel zone, the first three stepsof iteraion on parameter selection produced the following results :


16/22



final value 107

C final value 1000



final value 107

C final value 1000




final value 107

C final value 1000



At first step, the misclassification rate is the same as with the second re-sizing method; the decay is indeed much slower (the resizing is so smooth that second and third stepsstill give a rate of 0.430), but the calculation time are very poor. The third iteration

takes 1590.78 seconds to compute, compared to 161.975 with the convenient method.The conclusion of this section is that there might be an actual trade-off between com-puting performances and avoiding the unreached-minimum problem in many cases.

2.6 Going further : enriching our model

In the first two sections, we trained our model on two different subsets : aeroplane train.txt and horse train.txt, trying to make predictions for both aeroplanes and horses.Here, we will include more objects - horses, background, motorbikes, and cars - in the

class -1, and leave aeroplanes in the class 1 ; we will only try to classify files from testingtest aeroplane val.txt. Our goal is here to show how using a larger training set canimprove our predictions.

Let us compare the results between a class -1 testing test containing only horses - case(A) -, and the testing set described above - case (B). RBF kernel is the optimal kerneltype in both cases. Zone used is of size 300x300.

2.6.1 Case (A) : limited dataset

Grid C

min val 107 103

max val 103 + 1010 107 + 103


Number of class -1 files 139Files tested 126


17/22



final value 3.83108

C final value 177.8



final value 3.28108

C final value 56.51




final value 2.27108

C final value 26.25




final value 1.59108

C final value 13.98



2.6.2 Case (B) : richer dataset

Grid C

min val 107 103

max val 103

+ 1010

107

+ 103


Number of class -1 files 1717

Files tested 126


final value 106

C final value 1000



We directly see here, after only 1 iteration, that the classification accuracy is much

better ; the larger the initial training set, the better. Note that calculation time canreach quite high rates for very large datasets.


18/22


2.7 Conclusions

From all the experiments we conducted in this section, we can draw the following con-clusions :

The number of misclassified images is improved by automatically training our modelon a parameter grid.

It can also be improved by selecting the best parameter iteratively, and shriking ourgrid after each step. Choosing the right shrinking algorithm is very important, and can be very tricky.Indeed, for a very sharp resizing, calculation time can be acceptable but we might leavethe point of minimal misclassification out of the grid.Using a large training set is always a good thing, as it improves drastically classificationaccuracy.


19/22

Appendix A

Unbalanced data set

In this study, we luckily were in possession of well-balanced data sets : the numberof files for each subset were of the same order. However, in general, data sets can beunbalanced : one class may contain a lot more example than others. The principalproblem linked to these data sets is that we no longer can say that a classifier is efficient

just by looking at the accuracy.

Indeed, lets say that the ratio is 99% - for, per ex., the class 1 - against 1% - for class-1.A classifier which misclassifies every vector which belonging to class 1, but well classifesthe vectors of the other class will return a 99% accuracy. Yet if you are especiallyinterested in the other class in your study, this separator is not very useful.

There are several ways to avoid this problem, we will treat the most well known: differentcosts for misclassification.

A.1 Different costs for misclassification

Let us consider an unbalanced data set of the following form :

D= {( xi, yi), 1 i m | i, yi {1; 1} , xi Rq} , (m, q) N2 (A.1)

The optimization problem remains the same as in (1.9):

minw,b,

w2

2 + C

mi=1

i (A.2)

under( xi, yi) D, yi( w, xi+ b)> 1 i, i 0.

The solution is to replace the total misclassification penalty term Cm

i=1 i by a newone :

C+

jJ+

j+ C

jJ

j (A.3)

C+ 0, C 0,

17


20/22

Appendix A. Unbalanced data set 18

J+= {j 1, m | yj = 1} , J= {j 1, m | yj =1} .

One condition has to be satisfied, in order to give equal overall weight to each class :the total penalty term has to be the same for each class. A hypothesis commonly madeis to suppose that the number of misclassified vectors in each class is proportional to

the number of vector in each class, leading us to the following condition :

CCard(J) =C+Card(J+) (A.4)

If, for instance, Card(J) Card(J+), then C C +. A larger importance will begiven to misclassified vectors xi such that yi = 1.


21/22

Appendix B

Multi-class SVM

Several methods have been suggested to extend the previous SVM scheme to solvemultiple-class problems. [2] All the following schemes are applicable to any binaryclassifier, and are not exclusively related to SVM. The most famous methods are theone-versus-all and one-versus-one methods.

B.1 One-versus-all

In this and the following subsections, the training and testing sets can be classified inM classes C1, C2, ...CM.

Theone-versus-allmethod is based on the construction ofMbinary classifiers, each la-belling 1 a specified class, -1 the others. During the testing phase, the classifier providingthe highest margin wins the majority vote.

B.2 One-versus-one

Theone-versus-allmethod is based on the construction of M(M1)2 binary classifiers byconfronting each of the Mclasses. During the testing phase, every point is analysed byeach classifier, and a majority vote is conducted to determine its class. If we denote xt

the point to classify and hij the SVM classifier separating classes Ci, Cj , then the labelawarded to xt can be formally written :

Card({hij( xt)} {k} | i ,j,k [1; M], i < j) (B.1)

This represents the class awarded to xt most of the time, after being analysed by all theclassifiershij . Some ambiguity may exist in the counting of votes, if there is no majorityelection.

Both methods presents downsides. For theone-versus-allversion, nothing indicates thatthe classification results between the Mclassifiers are comparable. Besides, the problemisnt well-balanced anymore : for example, with M= 10, we use only 10% of positives

examples, against 90% negative ones.

19


22/22

Bibliography

[1] Vladimir N. Vapnik. The Nature of Statistical Learning Theory, 1995.

[2] Christopher M. Bishop. Pattern Recognition And Machine Learning, 2006.

20

svm & image classification

Documents