support vector machin es versus boosting - semantic scholar · pdf filewe concluded that the...

- 1 -

Support Vector Machines versus Boosting

Hao Zhang, Chunhui Gu

Department of Electrical Engineering and Computer Sciences University of California, Berkeley, U.S.A

Email: {zhanghao, chunhui}@eecs.berkeley.edu Abstract: Support Vector Machines (SVMs) and Adaptive Boosting (AdaBoost) are two successful classification methods. They are essentially the same as they both try to maximize the minimal margin on a training set. In this work, we present an even platform to compare these two learning algorithms in terms of their test error, margin distribution and generalization power. Two basic models of polynomials and decision stumps are used to evaluate eight real-world binary datasets. We concluded that the generalization power of AdaBoost with linear SVMs as base learners and that of SVMs with decision stumps as kernels outperform other scenarios. Although the training error of AdaBoost approaches to zero with the increase number of weak learners, its test error starts to rise at a certain step. For both SVMs and AdaBoost, the cumulative margin distribution is indicative of the test error. Key words: SVM, AdaBoost, margin I. INTRODUCTION

SVMs and boosting are two techniques for learning both having received a considerable attention in the recent years and many successful applications have been described in the literature [1]. SVMs and boosting have something in common to justify their success, namely the margin. The objective of SVMs is to maximize the separation between the classes. By using a kernel trick to map the training samples from an input space to a high dimensional feature space, SVM finds an optimal separating hyperplane in the feature space and uses a regularization parameter to balance its model complexity and training error. While SVMs explicitly maximizes the minimum margin, boosting tends to do the same thing indirectly through minimizing a cost function related to margin. Boosting is a general technique for improving performance of any given classifier [2]. It can effectively combine a number of weak classifiers into a strong classifier which can achieve an arbitrarily low error rate given sufficient training data, although each weak classifier might do a little better than random guessing. As the most popular Boosting method, AdaBoost [6] creates a collection of weak learners by computing a set of weights over training samples in each iteration and adjusting these weights based on the combined learner that is computed up to date. The weights of the misclassified samples by the current learner are increased while those of the correctly classified are be decreased. [2]

explains the effectiveness of the boosting algorithm in improving generalization performance based on the notion of a margin that can be interpreted as a measure of confidence in the prediction. Many studies that use decision trees or neural networks as weak learners for AdaBoost have been reported. These studies showed the good generalization performance of AdaBoost.

Much work has been done in both SVMs and AdaBoost in a number of applications. [5] studies AdaBoost with SVM weak classifiers. [3] aims to mathematically construct SVMs from boosting algorithms and vice versa. [7] works on scaling-up SVMs using boosting algorithm. However, little work has been done analytically comparing SVMs and AdaBoost in terms of their margin and test error using same resource data. In this work we experimentally compare the training and test errors of SVMs and boosting when both techniques are given the same resources to work with. To do this, we focus on two models, polynomial model and decision stumps, as our basic learning model and apply the two techniques and compare their performance. We compare the margins produced by the two techniques, and see how indicative they are of the test error. It is shown in the result that the training error of AdaBoost decreases to zero with the increase of weak classifiers. While the error decreases the margin distribution of under both SVMs and AdaBoost shifts to larger values and thus is indicative of the training and test error. We also compare the performances of both methods in terms of the size of dataset, dimension of dataset, stability of the

- 2 -

algorithms, and computation complexity. In section II, we review the basic background of the

algorithms of SVMs and AdaBoost. SVMs use quadratic programming to maximize the minimum distance the data are from the hyperplane, i.e. the margin, with a set of slack variables. AdaBoost minimizes the exponential loss function using forward stagewise additive modeling. It sets the data with the smallest decision value with the most weight and aims to minimize the weighted error in the next iteration. Section III discusses the essence of both SVMs and AdaBoost. Both methods aim to find a linear combination in a high dimensional space which has a large margin on the instances in the sample. The norms used to define the margin are different. SVMs aim to maximize the minimal margin while boosting aims to minimize an exponential weighting of the examples as a function of their margins. In comparison, both boosting and SVMs aim to find a linear classifier in a very high dimensional space. Two algorithms differ in terms of computation. SVMs directly solve the quadratic programming to maximize the minimum margin while AdaBoost indirectly decreases the portion of small margins with the increase of number of weak classifiers. Section IV respectively depicts the flowchart of the work of SVMs and AdaBoost using polynomial model and decision stump. A description of the dataset is also given in this section. Section V demonstrates the results of SVMs and AdaBoost on these datasets in terms of training error, test error and margin distributions. The following sections analyze the results of the work and discuss future work. II. SVMs and Boosting A. Support Vector Machines

SVMs was developed from the theory of structural risk minimization. In a binary classification problem, SVMs’ decision function is expressed as:

( ) ( )0Tf x sign x β β= + (1)

The above function gives a signed distance from a point x to the hyperplane 0 0Tx β β+ = . In the nonseparable case, solving the parameters of the decision function becomes a quadratic programming scenario, i.e:

( )0

2

, 1

0

1min2

subject to 0, 1 ,

N

ii

Ti i i i iy x x i

β ββ γ ξ

ξ β β ξ=

+

≥ + ≥ − ∀

∑ (2)

By using Lagrange multipliers, we obtain the Lagrangian

dual objective function:

1 1 1

12

N N NT

D i i j i j i ji i j

L y y x xα α α= = =

= −∑ ∑∑ (3)

Subject to:

( ) ( )

( ) ( )0

0

1 1

1 0

1 0

, 0, 0

Ti i i i

Ti i i

N N

i i i i i i ii i

i i

y x

y x

y x y

α β β ξ

β β ξ

β α α μ ξ

α γ μ= =

⎧ ⎡ ⎤+ − − =⎣ ⎦⎪⎪ + − − ≥⎪⎨⎪ = = =⎪⎪ = −⎩

∑ ∑ (4)

Together these equations uniquely characterize the solution to the primal and dual problem, with γ the tuning parameter. From the above equations we can see

that 1

N

i i ii

y xβ α=

=∑ with nonzero coefficients iα only

for those observations i for which the second constraint is exactly met. These observations are called support

vectors, and β is represented in terms of them alone.

By applying kernel trick, the Lagrangian dual objective functions can be written as:

( )1 1 1

1 ,2

N N N

D i i j i j i ji i j

L y y K x xα α α= = =

= −∑ ∑∑ (5)

where ( ),i jK x x is the kernel of data ix and jx .

In this work, we mainly focus on two learning models: polynomial and decision stump. In polynomial model, the

kernel simply becomes ( ) ( ), 1M

i j i jK x x x x= + i

where M is the degree of polynomials. In decision stump model, the kernel becomes a dot product of two

vectors ( ) ( ) ( ),i j i jK x x h x h x= i , where each ( )h x

consists of outputs of a set of decision stumps on x . B. AdaBoost

The ensemble method, which finds a highly accurate classifier by combining many moderately accurate component classifiers, has recently been very successful

- 3 -

in machine learning. One of the most commonly used techniques for constructing ensemble classifiers is AdaBoost. AdaBoost finds a combination of a number of weak classifiers in a stepwise additive manner. The weak classifier in each iteration step is trained on the resampled data according to the distribution based on a series of weights obtained from the training error by the learner computed up-to-date. The success of AdaBoost can be explained as enlarging the margin [4], which could enhance AdaBoost’s generalization capability. In general, boosting is a way of fitting an additive expansion in a set of elementary “basis” functions, which take the form:

( ) ( )1

;M

m mm

f x b xβ γ=

=∑ (6)

where , 1, 2,...,m m Mβ = are the expansion coefficients, and ( ); mb x γ ∈R are usually simple functions of the multivariate argument x , characterized by a set of parameters γ . Typically the learning models formed by these additive expansions are fit by minimizing a loss function averaged over the training data, i.e.:

{ }

( ), 1 1

min , ;N M

i ii m

L y b xβ γ

β γ= =

⎛ ⎞⎜ ⎟⎝ ⎠

∑ ∑ (7)

In order to reduce computational consumption, an alternative can often be found when it is feasible to solve the sub-problem of fitting just a single basis function,

{ }

( )( ), 1

min , ;N

i ii

L y b xβ γ

β γ=∑ (8)

By using forward stage-wise modeling [8], we can approximate the solution by sequentially adding new basis functions to the expansion without adjusting the parameters and coefficients of those that have already been added, and thus the method of AdaBoost is obtained. One of the principle attractions of any loss function in the context of additive modeling is computational. Recall the rule of the maximum a posteriori (MAP). The idea is to choose a function ( )L i such that it minimizes:

( )( )( ), |E L Y h X X (9)

where ( )h X is the basis function expansion. Now by using exponential as the loss function, one has

( )( ) ( )( ), expi m i i m iL y h x y h x= − (10)

where the basis functions are the individual classifiers ( ) { }1,1mh x ∈ − and iy are the labels. The combined

classifier is:

( ) ( )1

M

m mm

h x h xα=

=∑ (11)

Define

( ) ( )1

n

n m mm

f x h xα=

=∑ (12)

using forward stage-wise modeling, one must solve:

( )

( ) ( )( )1, 1

,

arg min exp

n n

N

i n i iG i

h

y f x G xβ

α

β−=

⎡ ⎤= − +⎣ ⎦∑ (13)

And this can be expressed as:

( ) ( ) ( )( ), 1

, arg min expN

nn n i i i

G ih y G x

βα ω β

=

−∑ (14)

Since ( )niω depends neither on nα nor nh , it can be

regarded as a weight applied to each observation and this weight only depends on the combined weak classifiers ( )1n if x− that is computed up to date, and the individual weight values change with each iteration n . For any fixed value of 0nα > , one has the solution for

( )nh x to be:

( ) ( )( )1

arg minN

nn i i i

G i

h I y G xω=

= ≠∑ (15)

By substituting the solved nh into equation (13), one gets:

( )log 1 / / 2n n nerr errα = −⎡ ⎤⎣ ⎦ (16)

where

( ) ( )( ) ( )

1 1/

N Nn n

n i i n i ii i

err I y h xω ω= =

⎡ ⎤= ≠⎢ ⎥⎣ ⎦∑ ∑ (17)

And the approximation of the basis expansions is then updated through:

( ) ( ) ( )1n n n nf x f x h xα−= + (18)

which causes the weights on the observations for the next iteration to be

( ) ( ) ( )( )1 expn ni i n i n iy h xω ω α+ = − (19)

Note that:

( ) ( )( )2 1i n i i n iy h x I y h x− = ≠ − (20)

- 4 -

we then have:

( ) ( ) ( )( )21 n i n i nI y h xn ni i e eα αω ω ≠+ −= (21)

Thus we can summarize the AdaBoost algorithm as follows:

AdaBoost

1. Initialize the observation weights 1/i Nω = 2. For 1m = to M : (a) Fit a classifier ( )mh x to the training data using weights iω (b) Compute merr and ( )log 1 /m m merr errα = −

(c) Update ( )( )expi i m i m iI y h xω ω α⎡ ⎤← ≠⎣ ⎦

3. Output ( ) ( )1

M

m mm

f x sign h xα=

⎛ ⎞= ⎜ ⎟⎝ ⎠∑

Now return to the MAP for loss function ( )( ) ( )( ), expL Y f X yf x= − :

( )( )

( )( )

( )( ) ( ) ( ) ( )( )

*|argmin

argmin Pr 1| Pr 1|

Yf xY xf x

f x f x

f x

f x E e

Y x e Y x e

−

−

=

= = + = −(22)

And it reaches its minimum when:

( ) ( ) ( ) ( )

( ) ( )( )

* *

*

Pr 1| Pr 1|

Pr 1|1 log2 Pr 1|

f x f xY x e Y x e

Y xf x

Y x

−= = = −

=⇒ =

= −

(23)

Thus, the additive expansion produced by AdaBoost justifies using its sign as the classification rule. Note that the weak classifiers can be of any kind. In this work we mainly focus on linear SVM and decision stumps models. III. Maximize the Margin

It is a common statement that SVMs and Boosting algorithms have something essentially the same to justify their success, namely the margin. SVM uses 2l norm and Boosting uses 1l norm in the coefficients related to the margin. It can be seen that while SVM explicitly maximize the minimum margin, boosting tends to do it indirectly through minimizing a cost function related to margin.

The use of the margins of classifiers to predict generalization error was previously studied by Vapnik [9]. One of the main ideas behind any optimal margin

classification is that some nonlinear classifiers on a low dimensional space can be transformed into linear classifiers over a high dimensional space. For example, consider the classifier that labels an instance x R∈ as +1 if 5 22 3 4 5x x x− + > and -1 otherwise. This classifier can be seen as a linear classifier if we represent each instance of the data by the vector ( ) ( )2 3 4 51, , , , ,h x x x x x x and we set

( )5, 4, 3,0,0, 2α = − − . As a result, the classification is +1 when ( ) 0h xα >i and -1 otherwise. Using kernels, it is usually easy to find an efficient way for calculating the predictions of linear classifiers in the high dimensional space. In addition, the prescription suggested by Vapnik so as to find the classifier that maximizes the minimal margin is used to enhance the generalization ability of the possible classifiers. Consider a binary classification example. Suppose the training sample S consists of pairs of the form ( ),x y where x is the instance and { }1, 1y∈ − + is its label. Assume that ( )h x is some fixed nonlinear mapping of instances into nR , where n is typically very large. Then the maximal

margin classifier is defined by the vector α which maximizes

( )

( )( ) ( ),

2 2

minx y S

y h x yh xα

α α∈=

i (24)

where 2

α is the 2l or Euclidean norm of the vector.

Recall that for SVM the expression

( ) 0Th x z β β= + (25)

while z the is the transformation of x that corresponds to SVM generalized inner product. The goal of SVM is to maximize the minimum of the above expression. From

expression (4) we know that 1

N

i i ii

y xβ α=

=∑ with

non-zero coefficients only for the support vectors. Denote

them , 1, 2,...,isx i K= where K is the number of

support vectors. Let i i is s sc yα= , then for a given data

x , its decision value can be expressed as:

- 5 -

( ) ( ) 01

,i i

K

s si

h x c K x x β=

= +∑ (26)

where ( ),isK x x is the kernel value of ( ),

isx x . The

normalization factor in expression (24) can be written as:

( ) 202 2

1 1,

i j i j

K K

s s s si j

c c K x xα β= =

= +∑∑ (27)

Thus the margin of a data point ( ),x y for SVM is

expressed as:

( )

( )

01

SVM2

0 21 1

,margin

,

i i

i j i j

K

s si

K K

s s s si j

y c K x x

c c K x x

β

β

=

= =

⎛ ⎞+⎜ ⎟⎝ ⎠=

+

∑

∑∑ (28)

And the margin of a dataset is defined as the minimum of the above expression on all pairs of data.

Recall that for AdaBoost, one tries to find the weak classifier in the n th step by minimizing the following function:

( ) ( )( )

( ) ( )( )

11

1

exp

exp

N

i n i ii

Nn

i i ii

y f x G x

y G x

β

ω β

−=

=

⎡ ⎤− +⎣ ⎦

= −

∑

∑ (29)

i.e. to find a weak classifier so as to minimize the training error of the data with the smallest ( )1i n iy f x− . Thus we can define the margin of AdaBoost to be:

( )

( )( )

( )( ) ( )( )

, , ,min min min

x y x y x yyh x y h x yf xα= =i (30)

Thus the minimum of ( )1nyf x− in step n has the

largest weight and the next step is to find a classifier to minimize the error on that data, i.e. to maximize margin. Here we view the coefficients of each weak

classifier ( )1 2, ,..., Mα α α α , the classifiers

( ) ( ) ( )( )1 2, ,..., Mh h x h x h x and assume the

coefficient vector α a unit vector.

Note that by the construction above, we only know the algorithm tends to minimize the cumulative error of the data that have been “mostly” misclassified by the most updated learner by adding a new weak learner. How can we know that the overall combined classifier tend to minimize the maximum error, i.e. to maximize the minimum margin with the increase of number of weak classifiers? This has been proved in [6]. The idea is that let D be some fixed but unknown distribution over

{ }1, 1X × − + . The training set has N pairs ( ) ( ) ( ){ }1 1 2 2, , , ,..., ,N NS x y x y x y= chosen according

to the distribution D . The terms , 1, 2,...,merr m M= are the weighted sum of training errors by combined classifier up to the m th iteration. It can be shown that for any value of θ , we have that:

( )( ) ( )11

1

2 1M

MS m m

m

P yf x err err θθθ +−

=

≤ ≤ −∏ (30)

which means the probability that the margin on the training set is less than any given value θ is always less and equal than some value computed using the number of classifiers and the weighted errors. Since in this work we only consider binary prediction problems, a random prediction will be correct exactly half of the time. Thus we can always assume that for all m , 1/ 2merr γ≤ − for some 0γ > , i.e. the predictions obtained by the weak classifiers are at least or slightly better than random guessing. Given this condition, we can simplify the upper bound in equation to [6]:

( )( ) ( ) ( )( )1 11 2 1 2M

SP yf x θ θθ γ γ− +≤ ≤ − + (31)

Since 0γ > , we can always choose θ γ< and thus the expression inside the parentheses above is smaller than 1. Note that ( )yf x is defined the margin of data pair ( ),x y . Therefore, the probability ( )yf x θ≤ , i.e. the probability that the margins are less than θ decreases to zero exponentially fast with the number of weak classifiers. This equally means the probability of misclassification on the training set approaches 0 when the number of classifier increases. Viewed this way, the function of AdaBoost becomes clear that it tries to minimize the error by maximizing the minimum margin.

From the discussion above, we can see that both methods of SVMs and AdaBoost aim to find a linear combination in a transformed high dimensional space

- 6 -

which might have a large margin on the instances in the sample, and thus both the algorithms try to maximize the margin in a high dimensional space. It can be seen that the norms used to define the margin are different. SVMs use 2l norm and AdaBoost uses 1l norm. In addition, the goals of the two methods are different. SVMs aim to directly maximize the margin while boosting aims to minimize an exponential weighting of the examples as a function of their margins. In terms of computation, SVMs use kernel tricks to perform computations in the high dimensional space, while boosting relies on finding weak classifiers in each step which explores the high dimensional space one coordinate at a time. IV. Learning Models

In this work we compare SVMs and AdaBoost using polynomial model and decision stumps respectively. In the polynomial case, SVMs transform the data into high dimensional space by using high order polynomial kernel and AdaBoost ensembles a number of linear SVMs classifier. On the other hand, in the decision stump case, we transform the data into a high dimensional vector of elements { }1± by passing it through a set of decision stumps. We learn SVMs using linear kernel on the transformed data. AdaBoost ensembles a number of decision stumps selected from a finite set of classifiers. A. Polynomial Model

In linear SVMs, we choose the classifier to be

( ) ( )0Tf x sign x β β= + , where β and 0β are

model parameters. By using kernel trick, the classifier

becomes ( ) ( )0Tf x sign z β β= + , where z is the

transformation of x that corresponds to a polynomial

kernel: ( ) ( )1 2 1 2 1 2, , 1 Mz z K x x x x= + , where M

is the complexity of the polynomial model. The

maximum-minimum margin is define as 2

1/ β . In

AdaBoost, we combine several weak classifiers

( ) ( )0T

m m mh x sign x β β= + into a final “strong”

classifier ( ) ( )( )f x sign h x= , where ( ) ( )1

M

m mm

h x h xα=

=∑ .

In order to find a weak classifier using the updated weight, we resample the data according to the distributions using ( )n

iω . Here we choose the resample ratio to be 20% to increase the training speed. Due to the down sampling, the weak classifier could yield an error greater than 0.5. Note that in the original AdaBoost algorithm, it stops when 1/ 2merr > . We modified this so that we inverse the trained weak classifier so that

( ) ( )m mh x h x= − and this makes 1/ 2merr < . This works in the binary classification case because if

1/ 2merr > , then more than half of the instances are wrongly classified. Then if we inverse the classifier such that the instances originally classified -1 are classified as +1 and vice versa, it yields a result of an error of 1 1/ 2merr− < . In this work, we use linear SVM classifier as weak classifiers, so as to compare with SVMs of higher order polynomials. In sum the AdaBoost with linear SVM as weak classifiers is used as follows:

AdaBoost

1. Initialize the observation weights 1/i Nω = 2. For 1m = to M :

(a) Resample the data according to the distribution iω .

Using linear SVM to train weak classifier

( ) ( )0T

m m mh x sign x β β= + on the resampled training

set. (b) Compute merr and mα . If 1/ 2merr > , reverse the weak classifier. (c) Update iω

3. Output ( ) ( )1

M

m mm

f x sign h xα=

⎛ ⎞= ⎜ ⎟⎝ ⎠∑

This sets up an even platform for comparison between SVM and AdaBoost methods with a polynomial model. We use artificial data to help visualize the performance of the algorithms. The artificial dataset is generated using mixture 2-dimensional Gaussian model. A demonstration of classification using the polynomial model is given in the following figure:

- 7 -

-5 -4 -3 -2 -1 0 1 2 3 4-5

-4

-3

-2

-1

0

1

2

3

4artificial dataset: polynomial on SVM

(a)

-5 -4 -3 -2 -1 0 1 2 3 4-5

-4

-3

-2

-1

0

1

2

3

4artificial dataset: polynomial on AdaBoost

(b)

Fig. 1 classification of mix 2D Gaussian data. (a) polynomial SVM, degree = 5, cost = 1 (b) AdaBoost, linear SVM weak

learner, 500 learners. B. Decision Stump

Given an instance-label pair ( ),x y , where Dx∈ℜ and { }1y∈ ± , a decision stump classifier can be

simply expressed as:

( ) ( )mh x sign x t= ± − (32)

where mx is the m th attribute of x , and t is a fixed threshold. As can be seen from the equation, the decision stump can either step up or step down with mx . Hence, each ( )h x can be characterized by three parameters { }, ,t m s , with t the threshold value, m the dimension that the classifier applies onto the data, and

{ }1s∈ ± indicating if the classifier steps up or down. Assuming our training set comprises N instance-label

pairs ( )1 1,x y , ( )2 2,x y ,…, ( ),N Nx y , and each ix

consists of D attributes (we also call it dimensions) 1 2, ,..., Di i ix x x⎡ ⎤⎣ ⎦ . The decision stump classifiers can be

constructed based on the training set as follows:

Decision Stump Classifier Construction

For 1m = to D : 1. Collect the m ’th attribute of each instance and form a

vector 1 2, ,...,m m mNx x x⎡ ⎤⎣ ⎦ . Sort and remove duplicated

elements of the vector. Thus the remaining vector

1 2, ,...,

m j

m m mi i ix x x⎡ ⎤

⎣ ⎦ is monotonously increasing.

2. Set a threshold vector 0 1 2, , ...,j

m m m mmt t t t⎡ ⎤

⎣ ⎦ to be

( )0 1 2 10.5m m m mt x x x= − × −

( )10.5 , 1,2,..., 1m m mi i i jt x x i m+= × + = −

( )10.5j j j j

m m m mm m m mt x x x −= + × −

3. Collect a set of decision stump classifiers for the m ’th

dimension: { } { }{ }1 1,j jm m

m i ii iH h

= ==

where { }1~ , , 1i ih t m− + are step-up decision stumps,

and { }~ , , 1i it m − are step-down ones. At most

2 2jm N≤ decision stumps can be found such that any

two of them produce different label patterns on the

m ’th attribute vector.

Finally, { } 1H D

m mH

== is the entire set of our decision

stump classifiers. Hence, each attribute results in at most 2N decision stumps, so that the entire training set yields at most 2DN distinct ones. Let 2K DN≤ be the total number of decision stumps constructed by the training set. These classifiers, which we denote as 1 2, ,..., Kh h h , help to set up an even platform for comparison between SVM and AdaBoost algorithms. The detailed implementations of SVM and AdaBoost algorithms using decision stumps are as follows.

The key idea behind SVM is to find a transformation that maps the original instance vector to a higher dimensional space. Hence, we consider the following transformation { }: 1 KZ X → ± with ( ) ( ){ } 1

Ki i

Z x h x=

= ,

- 8 -

an output vector by a set of decision stump classifiers we constructed before. The transformation maps the data from dimension D to dimension K>>D. Once we get

( )z Z x= , we train a linear SVM on it, and return a classifier of the form:

( ) ( )

( )( )0

01

T

Ki ii

f x sign z

sign h x

β β

β β=

= +

= +∑ (33)

where β and 0β are optimized parameters in SVM. The SVM margin of each instance vector is defined as:

( ) ( )0

2

i ii

y zmar x

β ββ+

= (34)

In the following, we will introduce the margin function of AdaBoost algorithm and will illustrate that the margin value is defined to be within the range [-1,+1]. To set up an even platform for comparison between SVM and AdaBoost, we hope that we can rescale the margin of SVM to make it within the same range. Fortunately, since:

( )2

2 1 11K Kk

i ik kz z K

= == = =∑ ∑ (35)

is independent of i, we can rewrite the margin function to be the original one divided by a constant K .

( ) ( )0

2

i ii

i

y zmar x

zβ ββ

+= (36)

It can be easily shown that the new margin value lies in the range [-1,+1].

The key idea behind AdaBoost is to find a set of weak classifiers that are used to construct the final classifier. Here we use the K decision stumps that we constructed before as the candidates of the weak classifiers. Unlike the polynomial case, here we encounter a finite set (K) of possible weak classifiers. Therefore, in each iteration of the AdaBoost algorithm, we are able to find the weak classifier that corresponds to the lowest weighted error by enumerating all the possibilities and choosing the best. The final classifier of AdaBoost has a similar form as that of SVM:

( ) ( )( )1CN

j jjf x sign h xα

== ∑ (37)

where CN is the number of weak classifiers, and α is a set of coefficients computed in each iteration. The AdaBoost margin of each instance vector is defined as

( )( )( )1

1

CNi j j ij

i

y h xmar x

α

α=

=∑

(38)

Since both iy and ( ) { }1j ih x ∈ ± , it can be easily shown that values of these margins lie within the range [-1,+1].

This sets up an even platform for comparison between SVM and AdaBoost methods using decision stumps.

-5 -4 -3 -2 -1 0 1 2 3 4-5

-4

-3

-2

-1

0

1

2

3

4artificial dataset: decision stump on SVM

(a)

-5 -4 -3 -2 -1 0 1 2 3 4-5

-4

-3

-2

-1

0

1

2

3

4artificial dataset: decision stump on AdaBoost

(b)

Fig. 2 classification of mix 2D Gaussian data. (a) decision stump SVM, cost = 0.01 (b) AdaBoost, decision stump weak

learner, 500 learners. The following figures show the training and test error using SVMs and AdaBoost for two different models.

- 9 -

0 50 100 150 200 250 300 350 400 450 5000

5

10

15

20

25

30

Number of weak classifiers

Perc

enta

ge o

f err

ors

training and test errors on artificial data

AdaBoost training errorAdaBoost testing errorSVM training errorSVM testing error

(a)

0 50 100 150 200 250 300 350 400 450 5000

5

10

15

20

25

30


Perc

enta

ge o

f err

ors

training and test errors on artificial data

AdaBoost training errorAdaBoost testing errorSVM training errorSVM testing error

(b)

Fig. 3 AdaBoost training and test error with number of weak classifiers. (a) decision stump (b) polynomial

It can be seen that the training error decreases while test error approaches a certain level with the increase of number of weak learners. We’ll see more of this in the experiments on real world data, where the test error could rise after a certain step. V. Experiments A. Dataset

We apply real-world data in our performance evaluation. The real-world data are obtained from the UCI Machine Learning Repository [10] and StatLog [10]. We select 8 relatively complete datasets that vary substantially in data sizes and dimensions. [Table 1] They were transformed to a standard format, attributes followed by outputs, and incomplete examples were removed.

Table 1. Characteristics of real-world datasets1 Dataset Number Dimension Origin Australian Breast-cancer Cleveland Diabetes German Heart Sonar Votes84

690 683 297 768 1000 270 208 435

14 9 13 8 24 13 60 16

StatLog UCI UCI UCI StatLog StatLog UCI UCI

B. Results

We apply both polynomial and decision stump models as basic classifiers in our experiments. In both cases, we demonstrate the training and test error with the increase of number of weak classifiers in AdaBoost and also the margin distribution on both the training and test data. In SVMs we do cross-validation to select best parameters for different models (cost and degree for polynomial and cost only for decision stump). The results are shown as follows. Figure 3.1 to figure 3.8 demonstrate results of polynomial model and figure 4.1 to figure 4.8 demonstrate those of the decision stump model.

1 To see a description of these datasets, please go to http://www.ics.uci.edu/~mlearn/MLRepository.html.

- 10 -

Polynomial

Fig. (a) AdaBoost training and test error with number of weak classifiers (b) Cumulative margin distribution function on

training and test data of AdaBoost with 4000 classifiers and SVMs. (c) Cumulative margin distribution function on training

data of AdaBoost with different number of classifiers

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30


Perc

enta

ge o

f err

ors

Dataset: 1 australiandata size = 690, data dim = 14

AdaBoost training errorAdaBoost testing errorSVM training error C=0.5, M=4SVM testing error C=0.5, M=4

(a)

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

Number of weak classifiersPe

rcen

tage

of e

rror

s

Dataset: 2 breast-cancer-wisconsindata size = 683, data dim = 9

AdaBoost training errorAdaBoost testing errorSVM training error C=1, M=4SVM testing error

(a)

-0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 1 australianCDF of margins: SVM vs AdaBoost

AdaBoost trainAdaBoost testSVM train C=0.5, M=4SVM test C=0.5, M=4

(b)

-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 2 breast-cancer-wisconsinCDF of margins: SVM vs AdaBoost

AdaBoost trainAdaBoost testSVM train C=1, M=4SVM test C=1, M=4

(b)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 1 australianCDF of margins for AdaBoost

5 classifiers50 classifiers500 classifiers4000 classifiers

(c)

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 2 breast-cancer-wisconsinCDF of margins for AdaBoost


(c)

Fig. 4.1 Fig. 4.2

- 11 -

0 200 400 600 800 1000 1200 1400 1600 1800 20000

5

10

15

20

25

30


Perc

enta

ge o

f err

ors

Dataset: 3 clevelanddata size = 297, data dim = 13

AdaBoost training errorAdaBoost testing errorSVM training error C=1, M=3SVM testing error C=1, M=3

(a)

0 50 100 150 200 250 300 350 400 450 5000

5

10

15

20

25

30

35

40

45

50


Perc

enta

ge o

f err

ors

Dataset: 4 germandata size = 1000, data dim = 24


(a)

-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 3 clevelandCDF of margins: SVM vs AdaBoost


(b)

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 4 germanCDF of margins: SVM vs AdaBoost


(b)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 3 clevelandCDF of margins for AdaBoost


(c)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 4 germanCDF of margins for AdaBoost


(c)

Fig. 4.3 Fig. 4.4

- 12 -

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

30


Perc

enta

ge o

f err

ors

Dataset: 5 heartdata size = 270, data dim = 13


(a)

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20


Perc

enta

ge o

f err

ors

Dataset: 6 house-votes-84data size = 435, data dim = 16


(a)

-0.1 0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 5 heartCDF of margins: SVM vs AdaBoost


(b)

-0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 6 house-votes-84CDF of margins: SVM vs AdaBoost


(b)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 5 heartCDF of margins for AdaBoost


(c)

-0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 6 house-votes-84CDF of margins for AdaBoost


(c)

Fig. 4.5 Fig. 4.6

- 13 -

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

40

45

50


Perc

enta

ge o

f err

ors

Dataset: 7 sonardata size = 208, data dim = 60


(a)

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

30

35

40

45

50


Perc

enta

ge o

f err

ors

Dataset: 8 pima-indians-diabetesdata size = 690, data dim = 14

AdaBoost training errorAdaBoost testing errorSVM training error C=1, M=4SVM testing error C=1, M=4

(a)

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 7 sonarCDF of margins: SVM vs AdaBoost


(b)

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 8 pima-indians-diabetesCDF of margins: SVM vs AdaBoost


(b)

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 7 sonarCDF of margins for AdaBoost


(c)

-1.5 -1 -0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 8 pima-indians-diabetesCDF of margins for AdaBoost


(c)

Fig. 4.7 Fig. 4.8

- 14 -

Decision Stumps

Fig. (a) AdaBoost training and test error with number of weak classifiers (b) Cumulative margin distribution on training and

test data of AdaBoost with 4000 classifiers and SVMs. (c) Cumulative margin distribution on training data of AdaBoost with

different number of classifiers

0 50 100 150 200 250 3000

5

10

15

20

25

30


Perc

enta

ge o

f err

ors

Dataset: 1 australiandata size = 690, data dim = 14

AdaBoost training errorAdaBoost testing errorSVM training error C=0.005SVM testing error C=0.005

(a)

0 50 100 150 200 250 3000

1

2

3

4

5

6

7

8

9

10

Number of weak classifiersPe

rcen

tage

of e

rror

s

Dataset: 2 breast-cancer-wisconsindata size = 683, data dim = 9


(a)

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 1 australianCDF of margins: SVM vs AdaBoost

AdaBoost trainAdaBoost testSVM train C=0.005SVM test C=0.005

(b)

-0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 2 breast-cancer-wisconsinCDF of margins: SVM vs AdaBoost


(b)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 1 australianCDF of margins for AdaBoost


(c)

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 2 breast-cancer-wisconsinCDF of margins for AdaBoost


(c)

Fig. 5.1 Fig. 5.2

- 15 -

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

30

35

40

45

50


Perc

enta

ge o

f err

ors

Dataset: 3 clevelanddata size = 297, data dim = 13


(a)

0 200 400 600 800 1000 1200 1400 1600 1800 20000

5

10

15

20

25

30


Perc

enta

ge o

f err

ors

Dataset: 4 germandata size = 1000, data dim = 24


(a)

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 3 clevelandCDF of margins: SVM vs AdaBoost


(b)

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 4 germanCDF of margins: SVM vs AdaBoost


(b)

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 3 clevelandCDF of margins for AdaBoost


(c)

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 4 germanCDF of margins for AdaBoost


(c)

Fig. 5.3 Fig. 5.4

- 16 -

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30


Perc

enta

ge o

f err

ors

Dataset: 5 heartdata size = 270, data dim = 13


(a)

0 10 20 30 40 50 60 70 80 90 1000

5

10

15


Perc

enta

ge o

f err

ors

Dataset: 6 house-votes-84data size = 435, data dim = 16


(a)

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 5 heartCDF of margins: SVM vs AdaBoost


(b)

-0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 6 house-votes-84CDF of margins: SVM vs AdaBoost


(b)

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 5 heartCDF of margins for AdaBoost


(c)

-0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 6 house-votes-84CDF of margins for AdaBoost


(c)

Fig. 5.5 Fig. 5.6

- 17 -

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30


Perc

enta

ge o

f err

ors

Dataset: 7 sonardata size = 208, data dim = 60


(a)

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

30

35

40

45

50


Perc

enta

ge o

f err

ors

Dataset: 8 pima-indians-diabetesdata size = 768, data dim = 8


(a)

-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 7 sonarCDF of margins: SVM vs AdaBoost


(b)

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 8 pima-indians-diabetesCDF of margins: SVM vs AdaBoost


(b)

-0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 7 sonarCDF of margins for AdaBoost


(c)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

Dataset: 8 pima-indians-diabetesCDF of margins for AdaBoost


(c)

Fig. 5.7 Fig. 5.8

- 18 -

VI. Discussion

We compare the training and test errors and the margins of the two methods on several real-world datasets. We set the ratio of number of training data to test data to be 3:1.

From the margin distribution plot, we demonstrate that the minimum margin on the training set increases with number of weak classifiers in AdaBoost. Hence, maximizing the minimum margin becomes the common criteria for both SVMs and AdaBoost.

Table 2 lists a set of features that we use to evaluate the performance between SVMs and AdaBoost, thus it gives a comparison of the two algorithms.

Table 2 A comparison of SVMs and AdaBoost using

polynomial and decision stump respectively Methods SVMs AdaBoost

Models Polynomial Decision Stump Polynomial Decision Stump

Parametric Yes (cross validation) No

Complexity High Low Increase with number of weak classifiers

Training error Small but nonzero

Non-zero Approach zero as increase number of weak classifiers

Test error subject to

training error

Non- indicative

indicative Semi- indicative

Non- indicative

Margin dist. Minimum margin maximized

Increase even after training error becomes zero

Decision Bound Smooth Non-smooth

Generalization power

Low High Medium Low

Computation Data dimension

model complexity

Data dimension data size

Data size Data size

Besides the conclusions stated in the table, we should also note that there should be a stopping time for AdaBoost, i.e. the point from which test error starts to increase with the number of weak classifiers. The test margins are also indicative of this trend. The following figures show the test margins at different iteration steps of dataset 5 under the decision stump model.

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

margin x

F(x)

CDF of test margins for AdaBoost


Fig. 6 Margin distribution on test data from dataset 5, using

decision stump as weak learners.

Note that the “intercept” of the cumulative margin distribution increases with the increasing number of weak classifiers, which means the proportion of negative margins increases, i.e. the error increases. This result is consistent with Fig 5.5 (a).

The next steps of our work might include many aspects. One is that we can try AdaBoost using Radial Basis Function (RBF) SVMs as weak classifiers as RBF SVMs are most popular classification methods. We can also apply cross-validation when determining AdaBoost final learner. The idea is the same as that in other learning algorithms, e.g. we may use k-fold cross validation and choose the best AdaBoost learner. Reference [1] M. A. Hearst. Support vector machines. IEEE Intelligent Systems, pages 18-28, JulyIAugust 1998. [2] Robert E. Schapire. A Brief Introduction to Boosting. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999. [3] Gunnar Ratsch, Sebastian Mika, et al. Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No.9, September 2002 [41] Robert E. Schapire, Yoav Freund, et al. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. Proc. 14th International Conference on Machine Learning. [5] Xuchun Li, Lei Wang, Eric Sung. AdaBoost with SVM-based Component Classifiers. IEEE Transactions on System, Man and Cybernetics, Part B. [6] Robert E. Schapire, Yoav Freund, et al. Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of the

- 19 -

Fourteenth International Conference, 1997. [6] Jiri Matas and Jan S’ochman. http://cmp.felk.cvut.cz. Centre for Machine Perception Czech Technical University, Prague. [7] D. Pavlov, J. Mao, and B. Dom. Scaling-Up Support Vector Machines Using Boosting Algorithm. In Proceeding of ICPR'00, volume 2, pages 2219-2222, 2000. [8] Hastie, Tibshirani and Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001. [9] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. [10] D.J. Newman, S. Hettich, C.L. Blake and C.J. Merz. UCI Repository of machine learning databases. 1998. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Univ. of California, Irvine, Department of Information and Computer Science.

support vector machin es versus boosting - semantic scholar · pdf filewe concluded that the...

Documents