noise tolerant variants of the perceptron algorithm

Journal of Machine Learning Research 8 (2007) 227-248 Submitted 11/05; Revised 10/06; Published 2/07

Noise Tolerant Variants of the Perceptron Algorithm

Roni Khardon [email protected]

Gabriel Wachman [email protected]

Department of Computer Science, Tufts UniversityMedford, MA 02155, USA

Editor: Michael J. Collins

Abstract

A large number of variants of the Perceptron algorithm have been proposed and partially evaluatedin recent work. One type of algorithm aims for noise tolerance by replacing the last hypothesisof the perceptron with another hypothesis or a vote among hypotheses. Another type simply addsa margin term to the perceptron in order to increase robustness and accuracy, as done in supportvector machines. A third type borrows further from support vector machines and constrains theupdate function of the perceptron in ways that mimic soft-margin techniques. The performanceof these algorithms, and the potential for combining different techniques, has not been studied indepth. This paper provides such an experimental study and reveals some interesting facts about thealgorithms. In particular the perceptron with margin is an effective method for tolerating noise andstabilizing the algorithm. This is surprising since the margin in itself is not designed or used fornoise tolerance, and there are no known guarantees for such performance. In most cases, similarperformance is obtained by the voted-perceptron which has the advantage that it does not requireparameter selection. Techniques using soft margin ideas are run-time intensive and do not giveadditional performance benefits. The results also highlight the difficulty with automatic parameterselection which is required with some of these variants.

Keywords: perceptron algorithm, on-line learning, noise tolerance, kernel methods

1. Introduction

The success of support vector machines (SVM) (Boser et al., 1992; Cristianini and Shawe-Taylor,2000) has led to increasing interest in the perceptron algorithm. Like SVM, the perceptron algorithmhas a linear threshold hypothesis and can be used with kernels, but unlike SVM, it is simple andefficient. Interestingly, despite a large number of theoretical developments, there is no result thatexplains why SVM performs better than perceptron, and similar convergence bounds exist for both(Graepel et al., 2000; Cesa-Bianchi et al., 2004). In practice, SVM is often observed to performslightly better with significant cost in run time. Several on-line algorithms have been proposedwhich iteratively construct large margin hypotheses in the feature space, and therefore combinethe advantages of large margin hypotheses with the efficiency of the perceptron algorithm. Othervariants adapt the on-line algorithms to work in a batch setting choosing a more robust hypothesisto be used instead of the last hypothesis from the on-line session. There is no clear study in theliterature, however, that compares the performance of these variants or the possibility of combiningthem to obtain further performance improvements. We believe that this is important since thesealgorithms have already been used in applications with large data sets (e.g., Collins, 2002; Li et al.,2002) and a better understanding of what works and when can have a direct implication for future

c©2007 Roni Khardon and Gabriel Wachman.

KHARDON AND WACHMAN

use. This paper provides such an experimental study where we focus on noisy data and moregenerally the “unrealizable case” where the data is simply not linearly separable. We chose someof the basic variants and experimented with them to explore their performance both with hindsightknowledge and in a statistically robust setting.

More concretely, we study two families of variants. The first explicitly uses the idea of hard andsoft margin from SVM. The basic perceptron algorithm is mistake driven, that is, it only updatesthe hypothesis when it makes a mistake on the current example. The perceptron algorithm withmargin (Krauth and Mezard, 1987; Li et al., 2002) forces the hypothesis to have some margin bymaking updates even when it does not make a mistake but where the margin is too small. Addingto this idea, one can mimic soft-margin versions of support vector machines within the perceptronalgorithm that allow it to tolerate noisy data (e.g., Li et al., 2002; Kowalczyk et al., 2001). Thealgorithms that arise from this idea constrain the update function of the perceptron and limit theeffect of any single example on the final hypothesis. A number of other variants in this familyexist in the literature. Each of these performs margin based updates and has other small differencesmotivated by various considerations. We discuss these further in the concluding section of the paper.

The second family of variants tackles the use of on-line learning algorithms in a batch setting,where one trains the algorithm on a data set and tests its performance on a separate test set. Inthis case, since updates do not always improve the error rate of the hypothesis (e.g., in the noisysetting), the final hypothesis from the on-line session may not be the best to use. In particular,the longest survivor variant (Kearns et al., 1987; Gallant, 1990) picks the “best” hypothesis on thesequential training set. The voted perceptron variant (Freund and Schapire, 1999) takes a voteamong hypotheses produced during training. Both of these have theoretical guarantees in the PAClearning model (Valiant, 1984). Again, other variants exist in the literature which modify the notionof “best” or the voting scheme among hypotheses and these are discussed in the concluding sectionof the paper.

It is clear that each member of the first family can be combined with each member of thesecond. In this paper we report on experiments with a large number of such variants that arise whencombining some of margin, soft margin, and on-line to batch conversions. In addition to real worlddata, we used artificial data to check the performance in idealized situations across a spectrum ofdata types.

The experiments lead to the following conclusions: First, the perceptron with margin is the mostsuccessful variant. This is surprising since among the algorithms experimented with it is only onenot designed for noise tolerance. Second, the soft-margin variants on their own are weaker thanthe perceptron with margin, and combining soft-margin with the regular margin variant does notprovide additional improvements.

The third conclusion is that in most cases the voted perceptron performs similarly to the per-ceptron with margin. The voted perceptron has the advantage that it does not require parameterselection (for the margin) that can be costly in terms of run time. Combining the two to get thevoted perceptron with margin has the potential for further improvements but this occasionally de-grades performance. Finally, both the voted perceptron and the margin variant reduce the deviationin accuracy in addition to improving the accuracy. This is an important property that adds to thestability and robustness of the algorithms.

The rest of the paper is organized as follows. The next section reviews all the algorithms and ourbasic settings for them. Section 3 describes the experimental evaluation. We performed two kinds ofexperiments. In “parameter search” we report the best results obtained with any parameter setting.

228

NOISE TOLERANT PERCEPTRON VARIANTS

Input set of examples and their labels Z = ((x1,y1), . . . ,(xm,ym)) ∈ (Rn × {−1,1})m,η, θInit, C

• Initialize w← 0n and θ← θInit

• for every training epoch:

• for every x j ∈ X :

– y← sign(〈w,x j〉−θ)

– if (y 6= y j)

∗ w← w+ηy jx j

∗ θ← θ+ηy jC

Figure 1: The Basic Perceptron Algorithm

This helps set the scope and evaluate the potential of different algorithms to improve performanceand provides insight about their performance, but it does not give statistically reliable results. In“parameter optimization” the algorithms automatically select the parameters and the performancecan be interpreted statistically. The concluding section further discusses the results and puts relatedwork in their context leading to directions for future work.

2. Algorithms

This section provides details of all the algorithms used in the experiments and our basic settings forthem.

2.1 The Perceptron Algorithm

The perceptron algorithm (Rosenblatt, 1958) takes as input a set of training examples in Rn with

labels in {−1,1}. Using a weight vector, w ∈Rn, initialized to 0n, and a threshold, θ, it predicts the

label of each training example x to be y = sign(〈w,x〉−θ). The algorithm adjusts w and θ on eachmisclassified example by an additive factor. The algorithm is summarized in Figure 1 followingthe form by Cristianini and Shawe-Taylor (2000). Unlike Cristianini and Shawe-Taylor (2000), weallow for a non-zero value of θInit and we allow C to be initialized to any value.

The “learning rate” η controls the extent to which w can change on a single update. Note thatif θ is initialized to 0 then η has no effect since sign(〈w,xi〉 − θ) = yi iff sign(η(〈w,xi〉 − θ)) =yi. However, the initial choice of θ is important as it can be significant in early iterations of thealgorithm. Note also that there is only one “degree of freedom” between η and θInit since scalingboth by the same factor does not alter the result. So one could fix η = 1 and vary only θInit . But wekeep the general form for convenience.

When the data are linearly separable via some hyperplane (w,θ), the margin is defined as γ =min1≤i≤m(yi(〈w,xi〉−θ)). When ‖w‖= 1, γ is the minimum Euclidean distance of any point in thedata set to (w,θ). If the data are linearly separable, and θ is initialized to 0, the perceptron algorithm

229

KHARDON AND WACHMAN

is guaranteed to converge in≤ (Rγ )2 iterations (Novikoff, 1962; Cristianini and Shawe-Taylor, 2000),

where R = max1≤i≤m ‖xi‖.In the case of non-separable data, the extent to which a data point fails to have margin γ via

the hyperplane w can be quantified by a slack variable ξi = max(0,γ− yi(〈w,xi〉+ θ)). Observethat when ξi = 0, the example xi has margin at least γ via the hyperplane defined by (w,θ). Theperceptron is guaranteed to make no more than ( 2(R+D)

γ )2 mistakes on m examples, for any w,γ > 0

where D =√

∑mi=1 ξ2

i (Freund and Schapire, 1999; Shalev-Shwartz and Singer, 2005). Thus one canexpect some robustness to noise.

It is well known that the perceptron can be re-written in “dual form” whereby it can be usedwith kernel functions (Cristianini and Shawe-Taylor, 2000). The dual form of the perceptron isobtained by observing that the weight vector can be written as w = ∑m

i=1 ηαiyixi where αi is thenumber of mistakes that have been made on example i. Subsequently, a new example x j is classifiedaccording to sign(SUM− θ) where SUM = ∑i ηαiyi〈xi,x j〉. Replacing the inner product with akernel function yields the kernel perceptron. All the perceptron variants we discuss below have dualform representations. However, the focus of this paper is on noise tolerance and this is independentof the use of primal or dual forms and the use of kernels. We therefore present all the algorithms inprimal form.

2.2 The λ-trick

The λ-trick (Kowalczyk et al., 2001; Li et al., 2002) attempts to minimize the effect of noisyexamples during training similar to the L2 soft margin technique used with support vector ma-chines (Cristianini and Shawe-Taylor, 2000). We classify example x j according to sign(SUM−θ)where

SUM = 〈w,x j〉+ I jy jλ‖x j‖2 (1)

and where I j is an indicator variable such that

I j =

{

1 if x j was previously classified incorrectly0 otherwise

.

Thus during training, if a mistake has been made on x j then in future iterations we increase SUMartificially by a factor of λ‖x j‖2 when classifying x j but not when classifying other examples. Ahigh value of λ can make the term y jλ‖x j‖2 dominate the sum and make sure that x j does not leadto an update, so it is effectively ignored. In this way λ can prevent large number of updates on noisy(as well as other) examples limiting their effect.

We note that this technique is typically presented using an additive λ instead of λ‖x j‖2. Whilethere is no fundamental difference, adding λ‖x j‖2 is more convenient in the experiments since itallows us to use the same values of λ for all data sets (since the added term is scaled relative to thedata).

2.3 The α-bound

This variant is motivated by the L1 soft margin technique used with support vector machines (Cris-tianini and Shawe-Taylor, 2000). The α-bound places a bound α on the number of mistakes thealgorithm can make on a particular example, such that when the algorithm makes a mistake onsome xi, it does not update w if more than α mistakes have already been made on xi. As in the case

230


of the λ-trick, the idea behind this procedure is to limit the influence of any particular noisy exam-ple on the hypothesis. Intuitively, a good setting for α is some fraction of the number of trainingiterations. To see that, assume that the algorithm has made α mistakes on a particular noisy examplexk. In all subsequent training iterations, xk never forces an update, whereas the algorithm may con-tinue to increase the influence of other non-noisy examples. If the ratio of non-noisy examples tonoisy examples is high enough the algorithm should be able to bound the effect of noisy examplesin the early training iterations while leaving sufficient un-bounded non-noisy examples to form agood hypothesis in subsequent iterations. While this is a natural variant, we are not aware of anyexperimental results using it. Note that in order for the α-bound to work effectively, the algorithmmust perform a high enough number of training iterations.

2.4 Perceptron Using Margins

The classical perceptron attempts to separate the data but has no guarantees on the separation mar-gin obtained. The Perceptron Algorithm using Margins (PAM) (Krauth and Mezard, 1987) attemptsto establish such a margin, τ, during the training process. Following work on support vector ma-chines (Boser et al., 1992; Cristianini and Shawe-Taylor, 2000) one may expect that providing theperceptron with higher margin will add to the stability and accuracy of the hypothesis produced.

To establish the margin, instead of only updating on examples for which the classifier makes amistake, PAM updates on x j if

y j(SUM−θ) < τ

where SUM is as in Equation (1). Notice that this includes the case of a mistake where y j(SUM−θ) < 0 and the case of correct classification with low margin when 0≤ y j(SUM−θ) < τ. In this way,the algorithm “establishes” some minimal separation of the data. Notice that the weight vector w isnot normalized during the learning process and its norm can increase with updates. Therefore, theconstraint imposed by τ becomes less restrictive as learning progresses, and the normalized marginis not τ. When the data are linearly separable and τ < γopt , PAM finds a separating hyperplanewith a margin that is guaranteed to be at least γopt

τ√8(ηR2+τ) , where γopt is the maximum possible

margin (Krauth and Mezard, 1987; Li et al., 2002).1

As above, it is convenient to make the margin parameter data set independent so we can use thesame values across data sets. To facilitate this we replace the above and update if

y j(SUM−θ) < τθInit

so that τ can be thought of as measured in units of θInit .A variant of PAM exists in which a different value of τ is chosen for positive examples than

for negative examples (Li et al., 2002). This seems to be important when the class distribution isskewed. We do not study that variant in this paper.

2.5 Longest Survivor and Voted Perceptron

The classical perceptron returns the last weight vector w, that is, the one obtained after all traininghas been completed, but this may not always be useful especially if the data is noisy. This is a generalissue that has been studied in the context of using on-line algorithms that expect one example at a

1. Other variants (e.g., Gentile, 2001; Tsampouka and Shawe-Taylor, 2005; Shalev-Shwartz and Singer, 2005) do nor-malize the weight vector and thus have better margin guarantees.

231

KHARDON AND WACHMAN

time in a batch setting where a set of examples is given for training and one hypothesis is used atthe end to classify all future instances. Several variants exist that handle this issue. In particularKearns et al. (1987) show that longest survivor hypothesis, that is, the one who made the largestnumber of correct predictions during training in the on-line setting, is a good choice in the sensethat one can provide guarantees on its performance in the PAC learning model (Valiant, 1984).Several variations of this idea were independently proposed under the name of the pocket algorithmand empirical evidence for their usefulness was provided (Gallant, 1990).

The voted perceptron (Freund and Schapire, 1999) assigns each vector a “vote” based on thenumber of sequential correct classifications by that weight vector. Whenever an example is misclas-sified, the voted perceptron records the number of correct classifications made since the previousmisclassification, assigns this number to the current weight vector’s vote, saves the current weightvector, and then updates as normal. After training, all the saved vectors are used to classify futureexamples and their classifications are combined using their votes. At first sight the voted-perceptronseems to require expensive calculation for prediction. But as pointed out by Freund and Schapire(1999), the output of the weight vector resulting from the first k mistakes can be calculated from theoutput of the weight vector resulting from the first k−1 mistakes in constant time. So when usingthe dual perceptron the prediction phase of the voted perceptron is not substantially more expensivethan the prediction phase of the classical perceptron, although it is more expensive in the primalform.

When the data are linearly separable and given enough iterations, both these variants will con-verge to a hypothesis that is very close to the simple perceptron algorithm. The last hypothesis willpredict correctly on all examples and indeed its vote will be the largest vote among all hypothe-ses used. When the data are not linearly separable the quality of hypothesis may fluctuate duringtraining as noisy examples are encountered.

2.6 Algorithms Summary

Figure 2 summarizes the various algorithms and prediction strategies in primal form. The algorithmshave corresponding dual forms and kernelized versions that we address in the experimental sections.

2.7 Multi-Class Data

Some of the data sets we experiment with have more than two labels. For these we used a standardsolution as follows. For each label l, we train a classifier to separate examples labeled l (the positiveexamples) from other examples (all of which become negative examples). This can be done on-linewhere we train all the classifiers “in parallel”. During testing we calculate the weight given by eachclassifier (before the sign function is applied) and choose the one maximizing the weight.

3. Experimental Evaluation

We ran two sets of experiments with the algorithms described above. In one set of experiments wesearched through a pre-determined set of values of τ, α, and λ by running each of the classical,longest survivor, and voted perceptron using 10-fold cross-validation and parameterized by eachvalue of τ, α, and λ as well as each combination of τ×α and τ×λ. This first set of experiments iscalled parameter search. The purpose of the parameter search experiments was to give us a com-prehensive view of the effects of the various parameters. This can show whether a method has any

232


TRAINING: Input set of examples and their labelsZ = ((x1,y1), . . . ,(xm,ym)) ∈ (Rn×{−1,1})m, θInit, η, C, α-bound, λ, τ

• Initialize α← 0m,k← 0,w0← 0n, tally← 0, best tally← 0, θ0← θInit ,θl.s.← θInit

• for every training iteration

• for every z j ∈ Z:

– SUM← 〈wk,x j〉+ I(α j 6=0) y jλ‖x j‖2

– Predict:

∗ if SUM < θk− τ then y =−1

∗ else if SUM > θk + τ then y = 1

∗ else y = 0 // This forces an update

– if (y 6= y j) AND (α j < α-bound)∗ wk+1← wk +ηy jx j

∗ α j← α j +1

∗ Update ‘‘mistakes’’ data structure

∗ θk+1← θk +ηy jC

∗ votek+1← 0

∗ k← k +1

∗ tally← 0

– else if (y = y j)

∗ votek← votek +1

∗ tally← tally+1

∗ if (tally > best tally)

· best tally← tally

· wl.s.← wk

· θl.s.← θk

PREDICTION: To predict on a new example xm+1:

• CLASSICAL:

– SUM← 〈wk,xm+1〉– y← sign(SUM−θk)

• LONGEST SURVIVOR:

– SUM← 〈wl.s.,xm+1〉– y← sign(SUM−θl.s.)

• VOTED:

– for i← 1 to k

∗ SUM← 〈wi,xm+1〉∗ y← sign(SUM−θi)

∗ VOTES ← VOTES +votei y

– y f inal ← sign(VOTES)

Figure 2: Illustration of All Algorithm Variants

233

KHARDON AND WACHMAN

chance of improving performance since the experiments give us hindsight knowledge. The exper-iments can also show general patterns and trends in the parameter landscape again giving insightinto the performance of the methods. Notice that parameter search cannot be used to indicate goodvalues of parameters as this would be hand-tuning the algorithm based on the test data. However, itcan guide in developing methods for automatic selection of parameters.

In the second set of experiments, we used the same set of values as in the first experiment,but using a method for automatic selection of parameters. In particular we used a “double crossvalidation” where in each fold of the cross validation (1) one uses parameter search on the trainingdata only using another level of cross validation, (2) picks values of parameters based on this search,(3) trains on the complete training set for the fold using these values, and (4) evaluates on thetest set. We refer to this second set of experiments as parameter optimization. This method isexpensive to run as it requires running all combinations of parameter values in each internal fold ofthe outer cross validation. So if both validations use 10 folds then we run the algorithm 100 timesper parameter setting. This sets a strong limitation on the number of parameter variations that canbe tried. Nonetheless this is a rigorous method of parameter selection.

Both sets of experiments were run on randomly-generated artificial data and real-world datafrom the University of California at Irvine (UCI) repository and other sources. The purpose ofthe artificial data was to simulate an ideal environment that would accentuate the strengths andweaknesses of each algorithm variant. The other data sets explore the extent to which this behavioris exhibited in real world problems.

For further comparison, we also ran SVM on the data sets using the L1-norm soft margin andL2-norm soft margin. We used SVM Light (Joachims, 1999) and only ran parameter search.

3.1 Data Set Selection and Generation

We have experimented with 10 real world data sets of varying size and 150 artificial data setscomposed of 600 examples each.

We first motivate the nature of artificial data used by discussing the following four idealizedscenarios. The simplest, type 1, is linearly separable data with very small margin. Type 2 is linearlyseparable data with a larger, user-defined margin. For type 3 we first generate data as in type 1,and then add random class noise by randomly reversing the labels of a given fraction of examples.For type 4, we generate data as in type 2, and then add random class noise. One might expectthat the basic perceptron algorithm will do fine on type 1 data, the perceptron with margin willdo particularly well on type 2 data, that noise tolerant variants without margin will do well ontype 3, and that some combination of noise tolerant variant with margin will be best for type 4data. However, our experiments show that the picture is more complex; preliminary experimentsconfirmed that the expected behavior is observed, except that the perceptron with margin performedwell on data of type 3 as well, that is, when the “natural margin” was small and the data was notseparable due to noise. In the following, we report on experiments with artificial data where themargin and noise levels are varied so we effectively span all four different types.

Concretely the data was generated as follows. We first specify parameters for number of fea-tures, noise rate and the required margin size. Given these, each example xi is generated inde-pendently and each attribute in an example is generated independently using an integer value inthe range [−10,10]. We generated the weight vector, w, by flipping a coin for each example andadding it to the weight vector on a head. We chose θ as the average of 1

2 |〈w,xi〉| over all xi. We

234


Name # features* # examples BaselineAdult 105 32561 75.9Breast-cancer-wisconsin 9 699 65.5Bupa 6 345 58USPS 256 9298 N/AWdbc 30 569 62.7Crx 46 690 55.5Ionosphere 34 351 64.4Wpbc 33 198 76.3sonar.all-data 60 208 53.4MNIST 784 70,000 N/A*after preprocessing

Table 1: UCI and Other Natural Data Set Characteristics

measured the actual margin of the examples with respect to the weight vector and then discardedany examples xi such that |〈w,xi〉 − θ| < M ∗ θ where M is the margin parameter specified. Forthe noisy settings, for each example we randomly switched the label with probability equal to thedesired noise rate. In the tables of results presented below, f stands for number of features, Mstands for the margin size, and N stands for the noise rate. We generated data sets with parame-ters ( f ,M,N)∈ {50,200,500}×{0.05,0.1,0.25,0.5,0.75}×{0,0.05,0.1,0.15,0.25}, and for eachparameter setting we generated two data sets, for a total 150 data sets.

For real world data we first selected two-class data sets from the UCI Machine Learning Repos-itory (Newman et al., 1998) that have been used in recent comparative studies or in recent paperson linear classifiers (Cohen, 1995; Dietterich et al., 1996; Dietterich, 2000; Garg and Roth, 2003).Since we require numerical attributes, any nominal attribute in these data sets was translated to aset of binary attributes each being an indicator function for one of the values. Since all these datasets have a relatively small number of examples we added three larger data sets to strengthen sta-tistically our conclusions: “Adult” from UCI (Newman et al., 1998), and “MNIST2” and “USPS3,”the 10-class character recognition data sets. Due to their size, however, for the “USPS,” “MNIST,”and “Adult” data sets, there is no outer cross-validation in parameter optimization. For ease ofcomparison to other published results, we use the 7291/2007 training/test split for “USPS”, and a60000/10000 split for “MNIST”.

The data sets used and their characteristics after the nominal-to-binary feature transformationare summarized in Table 1. The column labeled “Baseline” indicates the percentage of the majorityclass.

3.2 Exploratory Experiments and General Setup

The algorithms described above have several parameters that can affect their performance dramati-cally. In this paper we are particularly interested in studying the effect of parameters related to noisetolerance, and therefore we fix the value of θInit , η, C, and the number of training iterations. Prior to

2. http://yann.lecun.com/exdb/mnist/.3. http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/multiclass.html.

235

KHARDON AND WACHMAN

70

72

74

76

78

80

82

84

1 10 100 1000

Acc

urac

y%

Training Iterations

Training Iterations vs. Accuracy: promoters

Voted

♦ ♦ ♦ ♦♦

♦ ♦ ♦ ♦ ♦

♦Last

s

s

s s

s

s s s s s

s

Longest

4

4

4

44

4 4 4 4 4

4

Figure 3: Example Learning Curve for UCI Data Set

fixing these values, we ran preliminary experiments in which we varied these parameters. In addi-tion we ran experiments where we normalized the example vectors. The results showed that whilethe performance of the algorithms overall was different in these settings, the relationship betweenthe performance of individual algorithms seems to be stable across these variations.

We set θInit = avg(〈xi,xi〉), initializing the threshold at the same scale of inner products. Com-bined with a choice of η = 0.1, this makes sure that a few iterations should be able to guaran-tee that an example is classified correctly given no other changes to the hypothesis. We also setC = avg(〈xi,xi〉) = θInit . This is similar to the version analyzed by Cristianini and Shawe-Taylor(2000) except that we use the average instead of maximum norm. Using this setting we essentiallymaintain θ in units of θInit .

As explained above, the number of iterations must be sufficiently high to allow the α parameterto be effective, as well as to allow the weight vector to achieve some measure of stability. Typicallya small number of iterations is sufficient, and preliminary experiments illustrated by the curve inFigure 3 showed that 100 iterations are more than enough to allow all the algorithms to converge toa stable error rate. Therefore, except where noted below, we report results for 100 iterations.

For the large data sets, we reduced the number of training iterations and increased η accordinglyin order to run the experiments in a reasonable amount of time; for “Adult” we set η = 0.4 with5 training iterations and for “MNIST” we set η = 1 with 1 training iteration. For “USPS,” weused η = 0.1 and 100 iterations for parameter search, and η = 0.1 and 10 iterations for parameteroptimization.

The parameters of interest in our experiments are τ,α,λ that control the margin and soft marginvariants. Notice that we presented these so that their values can be fixed in a way that is dataset independent. We used values for τ ∈ {0,0.125,0.25,0.5,1,2,4}, α ∈ {∞,80,60,40,20,10,5},and λ ∈ {0,0.125,0.25,0.5,1,2,4}. Notice that the values as listed from left to right vary from noeffect to a strong effect for each parameter. We ran parameter search and parameter optimizationover all of these values, as well as all combinations τ×λ and τ×α. Since parameter optimizationover combined values is particularly expensive we have also experimented with a variant that firstsearches for a good τ value and then searches for a value of α (denoted τ→ α) and a variant thatdoes likewise with τ and λ (denoted τ→ λ).

We did not perform any experiments involving α on “MNIST” or “Adult” as their size requiredtoo few iterations to justify any reasonable α-bound.

236


V > C V > LS LS > C LS > V C > LS C > V V = LS = CNoise = 0 0 0 0 0 0 0 30Noise = 0.05 12 10 11 3 0 2 16Noise = 0.1 15 15 14 3 3 3 12Noise = 0.15 16 14 13 5 6 3 10Noise = 0.25 16 12 13 8 7 4 10

Table 2: Noise Percentage vs. Dominance: V = Voted, C = Classical, LS = Longest Survivor

Finally we performed a comparison of the classical perceptron, the longest survivor and thevoted perceptron algorithms without the additional variants. Table 2 shows a comparison of the ac-curacies obtained by the algorithms over the artificial data. We ignore actual values but only reportwhether one algorithm gives higher accuracy or whether they tie (the variance in actual results isquite large). One can see that with higher noise rate the voted perceptron and longest survivor im-prove the performance over the base algorithm. Over all 150 artificial data sets, the voted perceptronstrictly improves performance over the classical perceptron in 59 cases, and ties or improves in 138cases. Using the distribution over our artificial data sets one can calculate a simple weak statisticaltest that supports the hypothesis that using the voted algorithm does not hurt accuracy, and can oftenimprove it.

Except where noted, all the results reported below give average accuracy in 10-fold cross-validation experiments. To avoid any ordering effects of the data, the training set is randomlypermuted before running the experiments.

3.3 Parameter Search

The parameter search experiments reveal several interesting aspects. We observe that in generalthe variants are indeed helpful on the artificial data since the performance increases substantiallyfrom the basic version. The numerical results are shown in Tables 3 and 4 and discussed below.Before showing these we discuss the effects of single parameters. Figure 4 plots the effect ofsingle parameters for several data sets. For the artificial data set shown, the accuracy is reasonablywell-behaved with respect to the parameters and good performance is obtained in a non negligibleregion; this is typical of the results from the artificial data. Data obtained for the real world datasets show somewhat different characteristics. In some data sets little improvement is obtained withany variant or parameter setting. In others, improvement was obtained for some parameter valuesbut the regions were not as large implying that that automatic parameter selection may not be easy.Nonetheless it appears that when improvement is possible, τ on it own was quite effective whenused with the classical perceptron. Notice that τ is consistently effective across all data sets; λ andα do not help on all data sets and λ sometimes leads to a drop in performance. As one might expect,our preliminary experiments also showed that very large values of τ harm performance significantly.These are not shown in the graphs as we have limited the range of τ in the experiments.

Tables 3 and 4 summarize the results of parameter search experiments on the real world dataand some of the artificial data, respectively. For each data set and algorithm the tables give the bestaccuracy that can be achieved with parameters in the range tested. This is useful as it can indicatewhether an algorithm has a potential for improvement, in that for some parameter setting it givesgood performance. In order to clarify the contribution of different parameters, each column with

237

KHARDON AND WACHMAN

75

80

85

90

95

100

51020406080∞

Acc

ura

cy%

α

Parameter Search on α: f = 50, N = 0.05, M = 0.05

Voted

♦♦ ♦ ♦ ♦ ♦ ♦

♦Last

s

s

ss

ss

s

s

Longest

4 4 4 4 4 44

4

50

60

70

80

90

100

2−4 2−3 2−2 2−1 20 21 22

Acc

ura

cy%

λ

Parameter Search on λ: f = 50, N = 0.05, M = 0.05

Voted

♦ ♦♦

♦♦ ♦

♦Last

s s

s

s

ss

s

Longest4

44

44 4

4

75

80

85

90

95

100

2−4 2−3 2−2 2−1 20 21 22

Acc

ura

cy%

τ

Parameter Search on τ: f = 50, N = 0.05, M = 0.05

Voted

♦♦ ♦ ♦ ♦ ♦

♦Last

s

s s

s s s

s

Longest

44 4 4 4 4

4

75

80

85

90

95

100

51020406080∞

Acc

ura

cy%

α

Parameter Search on α: promoters

Voted

♦♦♦♦♦♦♦

♦Last

sssssss

s

Longest

4444444

4

50

60

70

80

90

100

0 2−3 2−2 2−1 20 21 22

Acc

ura

cy%

λ

Parameter Search on λ: promoters

Voted

♦♦

♦♦ ♦

♦

♦

♦Last

s

s

s

s s

s

s

s

Longest

4 4 4 4 4

4

4

4

75

80

85

90

95

100

0 2−3 2−2 2−1 20 21 22

Acc

ura

cy%

τ

Parameter Search on τ: promoters

Voted

♦

♦♦ ♦

♦♦ ♦

♦Last

s

s

ss

s s

s

s

Longest

4

4 4 4 4 4 44

75

80

85

90

95

100

51020406080∞

Acc

ura

cy%

α

Parameter Search on α: USPS

Voted

♦♦♦♦♦♦♦

♦Last

ssssss

s

s

Longest

444444

4

4

75

80

85

90

95

100

0 2−3 2−2 2−1 20 21 22

Acc

ura

cy%

λ

Parameter Search on λ: USPS

Voted

♦ ♦ ♦♦

♦ ♦ ♦

♦Last

s

s

ss

s s s

s

Longest

4 4

4

4

4 4 4

4

75

80

85

90

95

100

0 2−3 2−2 2−1 20 21 22

Acc

ura

cy%

τ

Parameter Search on τ: USPS

Voted

♦ ♦ ♦ ♦ ♦ ♦ ♦

♦Last

s

ss s s s s

s

Longest

4

4 4 4 4 4 44

75

80

85

90

95

100

0 2−3 2−2 2−1 20 21 22

Acc

ura

cy%

λ

Parameter Search on λ: MNIST

Voted

♦ ♦ ♦ ♦ ♦ ♦ ♦

♦Last

s s s s s s s

s

Longest

4 4 4 4 4 4 4

4

75

80

85

90

95

100

0 2−3 2−2 2−1 20 21 22

Acc

ura

cy%

τ

Parameter Search on τ: MNIST

Voted

♦ ♦ ♦ ♦ ♦ ♦ ♦

♦Last

s

s s

ss

s s

s

Longest

44

4 4 4 4 4

4

Figure 4: Parameter Search on Artificial and Real-world Data

parameters among τ,α,λ includes all values for the parameter except the non-active value. Forexample, any accuracy obtained in the τ column is necessarily obtained with a value τ 6= 0. Thecolumn labeled “Nothing” shows results when all parameters are inactive.

Several things can be observed in the tables. Consider first the margin parameters. We can seethat τ is useful even in data sets with noise; this is obvious both in the noisy artificial data sets and inthe real-world data sets, all of which are inseparable in the native feature space. We can also see thatα and λ do improve performance in a number of cases. However, they are less effective in generalthan τ, and do not provide additional improvement when combined with τ.

Looking next at the on-line to batch conversions we see that the differences between the basicalgorithm, the longest survivor and the voted perceptron are noticeable without margin based vari-ants. For the artificial data sets this only holds for one group of data sets ( f = 50), the one with

238


Baseline Nothing τ λ α τ×λ τ×αbreast-cancer-wisconsin Last 65.5 90.6 96.8 97.2 96.9 97.3 97.2

Longest 96.9 97.0 97.2 97.0 97.3 97.2Voted 96.9 96.8 97.2 96.9 97.3 97.2

bupa Last 58 57.5 71.8 64.1 69.2 71.5 67.8Longest 64.1 58.2 64.8 68.4 60.9 61.2Voted 68.9 65.9 67.7 70.7 67.4 66.0

wdbc Last 62.7 92.4 93.2 92.8 93.3 93.9 92.6Longest 92.4 93.2 92.6 92.4 92.8 92.8Voted 92.3 92.0 93.2 92.4 93.2 92.4

crx Last 55.5 62.5 68.6 65.5 66.7 69.0 66.7Longest 55.1 63.6 65.5 63.5 65.5 65.2Voted 64.9 65.4 65.5 66.7 65.5 66.4

promoters Last 50 78.8 92.8 82.1 78.8 94.4 93.8Longest 78.8 92.8 82.1 78.8 94.4 94.4Voted 78.8 93.4 82.1 78.8 94.4 93.8

ionosphere Last 64.4 86.6 87.5 86.9 87.7 88.6 87.2Longest 87.2 87.5 87.8 88.0 88.3 87.7Voted 88.0 87.7 87.5 88.6 88.9 87.5

wpbc Last 76.3 77.3 76.4 80.6 76.9 79.0 76.4Longest 77.2 76.4 77.4 78.5 76.9 76.4Voted 78.8 76.4 80.6 77.4 78.5 76.4

sonar.all-data Last 53.4 71.9 74.6 72.5 77.1 75.8 79.1Longest 75.3 77.1 73.3 77.2 77.7 78.7Voted 75.1 77.2 73.8 77.1 78.7 77.7

USPS Last N/A 91.5 94.1 90.4 92.8 94.3 94.1Longest N/A 86.9 93.9 90.4 92 93.9 93.9Voted N/A 93.5 94.2 93.4 93.6 94.3 94.3

MNIST Last N/A 85.2 89.2 85.2 89.2Longest N/A 81.7 88.6 81.7 88.6Voted N/A 90.2 90.6 90.2 90.6

Table 3: Parameter Search on UCI and Other Data Sets

highest ratio of number of examples to number of features (12 : 1).4 The longest survivor seems lessstable and degrades performance in some cases.

Finally compare the τ variant with voted perceptron and their combination. For the artificialdata sets using τ alone (with last hypothesis) gives higher accuracy than using the voted perceptronalone. For the real-world data sets this trend is less pronounced. Concerning their combination wesee that while τ always helps the last hypothesis, it only occasionally helps voted, and sometimeshurts it.

Table 5 gives detailed results for parameter search over τ on the adult data set, where we alsoparameterize the results by the number of examples in the training set. We see that voted perceptronand the τ variant give similar performance on this data set as well. We also see that that in contrast

4. The results for data with f = {200,500} are omitted from the table. In these cases these was no difference betweenthe basic, longest and voted versions except when combining with other variants.

239

KHARDON AND WACHMAN

Noise Pctg. (N) Nothing τ λ α τ×λ τ×αLast, N = 0 95.4 97 95.6 96.1 96.8 97Longest 95.4 96.6 95.6 95.9 97.3 96.8Voted 95.4 96.6 95.4 96.1 96.8 97

Last, N = 0.05 81.7 89.4 87 89.5 89.7 89.9Longest 86.6 89.7 87 89.3 89.9 89.9Voted 87.5 89.4 87 89.5 89.7 90.2

Last, N = 0.1 77.8 84.6 81.9 85.1 84.9 85.8Longest 77.2 84.9 81.9 85.1 84.6 85.5Voted 81.6 84.9 81.9 85.1 84.9 85.6

Last, N = 0.15 71.2 81.6 79.9 80.6 81.6 81.7Longest 72.9 80.7 79.9 79.2 80.7 81.7Voted 75.6 81.6 79.9 80.6 82.1 81.6

Last, N = 0.25 59.9 70.7 68.2 70.7 71.1 71.1Longest 65 70.7 68.2 70.1 70.4 70.6Voted 68 70.4 69.4 70.7 70.6 71.1

Table 4: Parameter Search on Artificial Data Sets with f = 50 and M = 0.05

with the performance on the artificial data mentioned above, voted perceptron performs well evenwith small training set size (and small ratio of examples to features).

The table adds another important observation about stability of the algorithms. Note that sincewe report results for concrete values of τ we can measure the standard deviation in accuracy ob-served. One can see that both the τ variant and the voted perceptron significantly reduce the variancein results. The longest survivor does so most of the time but not always. The fact that the variantslead to more stable results is also consistently true across the artificial and UCI data sets discussedabove and is an important feature of these algorithms.

Finally, recall that the average algorithm (Servedio, 1999), which updates exactly once on eachexample, is known to tolerate classification noise for data distributed uniformly on the sphere andwhere the threshold is zero. For comparison, we ran this algorithm on all the data reflected in thetables above and it was not competitive. Cursory experiments to investigate the differences showthat, when the data has a zero threshold separator and when perceptron is run for just one iteration,then (as reported in Servedio, 1999) the average algorithm performs better than basic perceptron.However, the margin based perceptron performs better and this becomes more pronounced withnon-zero threshold and multiple iterations. Thus in some sense the perceptron variants are doingsomething non-trivial that is beyond the scope of the simple average algorithm.

3.4 Parameter Optimization

We have run the parameter optimization on the real-world data sets and the artificial data sets.Selected results are given in Tables 6 and 7. For these experiments we report average accuracyacross the outer cross-validation as well as a 95% T -confidence interval around these, as suggested

240


# examples: 10 100 1000 5000 10000 20000Accuracy Std. Dev.

τ = 0Last 75.6 3.5 70.2 9.2 77.8 3.1 78.3 4.1 80.3 2.8 76.5 11.1Longest 75.7 3.5 76.5 3.6 79.9 3.5 81.8 1.8 82.9 0.7 82.3 1.3Voted 75.8 3.4 79.4 1.8 82.5 0.5 84 0.7 84.2 0.6 84.5 0.5τ = 0.125Last 72.7 4.6 73.3 8.9 79.7 3.7 80.8 2.7 82.7 1.4 80.4 3.9Longest 73.1 4.6 76.6 3.9 81 2.1 81.1 2.4 81.8 1.5 82.1 1.7Voted 74.1 4.4 79.6 1.4 82.7 0.5 83.9 0.7 84.2 0.6 84.3 0.4τ = 0.25Last 74.5 1.8 75.1 7.4 79.3 5.1 80.5 2.8 81.5 2.2 80.6 3.7Longest 73.9 3.9 77.2 2.1 80.3 2.6 81.8 1.8 81.7 1.7 82.3 2.4Voted 74.1 4.7 79.5 1.5 82.7 0.5 83.8 0.7 84.1 0.7 84.3 0.5τ = 0.5Last 71.8 6.3 75.5 1.9 80.4 3.4 82.6 1 82.7 1 82.8 1.5Longest 74 5.7 77.2 7.5 80.4 2.3 82.1 1.6 82.5 1.6 80.8 2.9Voted 73.6 5.6 78.3 0.8 82.6 0.6 83.7 0.7 84 0.7 84.2 0.5τ = 1Last 72.9 6.6 78.4 2.5 81.9 1.2 83 1.1 82.7 1.5 83.5 0.8Longest 75.9 0.3 75.9 0.9 78.3 2.8 82.5 1.1 83 1 81.9 2.7Voted 74.8 3.4 76.4 0.9 82.4 0.6 83.5 0.7 83.7 0.7 84.1 0.5τ = 2Last 75.5 1.2 75.8 2.6 82.4 0.4 83.2 0.9 82.8 1.2 83.6 0.8Longest 70.7 15.6 75.9 0.9 77.8 2.7 81.4 2.9 82.5 2 81.9 2.8Voted 70.7 15.6 76.4 1.8 81.8 1 83.2 0.7 83.5 0.6 83.8 0.6τ = 4Last 75.9 0.3 75.7 1.4 81.8 1 83.1 2.1 83.3 0.6 83.6 0.6Longest 70.7 15.6 75.9 0.9 75.9 0.5 81.8 0.7 79.4 3.4 80.8 3.1Voted 70.7 15.6 75.9 0.9 78.4 2.1 83 0.7 83.2 0.6 83.5 0.6

Table 5: Performance on “Adult” Data Set as a Function of Margin and Training Set Size

in Mitchell (1997).5 No confidence interval is given for “USPS,” “MNIST,” or “Adult,” as the outercross-validation loop is not performed.

In contrast with the tables for parameter search, columns of parameter variants in Tables 6 and 7do include the inactive option. Hence the search in the τ column includes the value of τ = 0 as well.This makes sense in the context of parameter optimization since the algorithm can choose betweendifferent active values and the inactive value. Notice that the standard deviation in accuracies is highon these data sets. This highlights the difficulty of parameter selection and algorithm comparisonand suggests that results from single split into training and test sets that appear in the literature maynot be reliable.

5. In more detail, we use the maximum likelihood estimate of the standard deviation σ, s =√

1k ∑i(yi− y)2 where k is

the number of folds (here k = 10), yi is the accuracy estimate in each fold and y is the average accuracy. We then

use the T confidence interval y ∈ y± t(k−1),0.975

√

1k(k−1) ∑(yi− y)2. Notice that since t9,0.975 = 2.262 and k = 10 the

confidence interval is 0.754 of the standard deviation.

241

KH

AR

DO

NA

ND

WA

CH

MA

N

Nothing τ λ α τ×α τ×λ τ→ α τ→ λ SVM L1 SVM L2Breast-cancer-wisconsin

Last 90.6 +/- 2.3 95.2 +/- 2.1 95.2 +/- 3.1 94.7 +/- 3.3 95.3 +/- 2.6 96.3 +/- 2.1 95.1 +/- 2.3 96.9 +/- 1.3Longest 96.9 +/- 1.2 96.8 +/- 1.7 97.2 +/- 1.4 96.3 +/- 1.9 96.8 +/- 1.6 97 +/- 1.6 96.8 +/- 1.7 96.9 +/- 1.8 96.8 +/- 2.2 96.7 +/- 1.7Voted 96.9 +/- 1.3 96.8 +/- 1.4 97 +/- 1.4 96.6 +/- 1.6 97 +/- 1.6 97.2 +/- 1.4 96.8 +/- 1.4 96.9 +/- 1.5

BupaLast 57.5 +/- 7.8 69.8 +/- 6.5 61.5 +/- 6.3 68.9 +/- 4.1 69.9 +/- 5.7 69.8 +/- 5.8 69.8 +/- 6.5 70.9 +/- 5.9

Longest 64.1 +/- 7.1 64.1 +/- 7.1 64.1 +/- 6.6 65.3 +/- 4.3 65.3 +/- 4.3 64.1 +/- 6.6 65.3 +/- 4.3 64.1 +/- 6.6 66.5 +/- 8.6 63.2 +/- 5.2Voted 68.9 +/- 5.8 68.9 +/- 5.8 67.5 +/- 6.4 68.9 +/- 6.3 68.9 +/- 6.3 67.5 +/- 6.2 68.9 +/- 6.3 67.2 +/- 6.0

WdbcLast 92.4 +/- 2.9 93 +/- 2.7 93.2 +/- 2.5 91.6 +/- 2.7 92.8 +/- 2.8 93.2 +/- 2.8 93 +/- 2.7 93.5 +/- 2.2

Longest 92.4 +/- 2.0 91.7 +/- 2.5 92.1 +/- 2.5 92.3 +/- 2.3 93.2 +/- 1.7 92.5 +/- 2.1 92.6 +/- 2.1 91.4 +/- 2.9 92.8 +/- 4.4 94.7 +/- 2.1Voted 92.3 +/- 2.3 92.3 +/- 2.6 92.8 +/- 2.7 91.7 +/- 1.9 91.6 +/- 2.2 92.4 +/- 2.4 91.6 +/- 2.6 92.8 +/- 2.3

CrxLast 62.5 +/- 5.1 68.7 +/- 2.9 64.5 +/- 4.1 65.8 +/- 4.1 68.1 +/- 3.5 68.7 +/- 2.2 68.7 +/- 2.9 67.8 +/- 2.8

Longest 55.1 +/- 7.3 60.4 +/- 6.5 62.6 +/- 4.8 62.6 +/- 4.4 64.5 +/- 4.9 60.9 +/- 5.7 62.9 +/- 4.6 59.4 +/- 6.7 76.6 +/- 13.7 65.9 +/- 22.8Voted 64.9 +/- 4.0 64.2 +/- 2.7 65.4 +/- 3.2 66.2 +/- 3.8 66.4 +/- 3.7 64.9 +/- 2.6 65.7 +/- 3.3 64.3 +/- 3.1

IonosphereLast 86.6 +/- 3.8 87.2 +/- 3.4 86.3 +/- 3.5 85.7 +/- 4.8 86.9 +/- 3.4 86.9 +/- 2.6 86.6 +/- 3.0 86.6 +/- 2.6

Longest 87.2 +/- 3.8 86.9 +/- 2.6 87.8 +/- 3.6 87.5 +/- 3.5 86.9 +/- 4.0 87.2 +/- 3.2 86.9 +/- 3.4 87.7 +/- 2.9 87.1 +/- 7.1 86.9 +/- 4.2Voted 88 +/- 3.7 86.3 +/- 4.3 86.6 +/- 3.5 87.7 +/- 3.2 87.5 +/- 3.2 87.7 +/- 3.1 86 +/- 4.9 86 +/- 3.9

WpbcLast 77.3 +/- 4.7 76.7 +/- 4.4 76.4 +/- 4.4 75.1 +/- 6.1 75.1 +/- 6.1 77 +/- 5.2 75.7 +/- 6.2 78.3 +/- 5.1

Longest 77.2 +/- 3.8 76.7 +/- 5.3 75.8 +/- 3.6 75.8 +/- 6.0 76.9 +/- 4.4 74.8 +/- 4.3 75.8 +/- 4.4 74.6 +/- 5.4 78.6 +/- 7.8 78.6 +/- 8.4Voted 78.8 +/- 4.7 78.5 +/- 4.8 79 +/- 5.0 76.4 +/- 4.8 76.9 +/- 4.8 79 +/- 5.0 76.9 +/- 4.8 78.5 +/- 5.1

SonarLast 71.9 +/- 6.3 73.1 +/- 5.4 72.4 +/- 6.0 75.5 +/- 6.4 72.1 +/- 7.4 71.1 +/- 6.5 74.1 +/- 6.4 73.6 +/- 5.1

Longest 75.3 +/- 5.1 77.6 +/- 6.2 73.3 +/- 6.0 74.3 +/- 6.9 73.5 +/- 5.9 74.1 +/- 7.5 74 +/- 5.5 78.1 +/- 6.6 57.2 +/- 11.9 55 +/- 11.8Voted 75.1 +/- 6.0 75.6 +/- 6.9 74.3 +/- 5.6 77.5 +/- 7.0 78.1 +/- 7.0 77.1 +/- 7.2 76.1 +/- 7.9 75.6 +/- 6.9

f=50,N=0.05,M=0.05Last 81.7 +/- 6.5 87.7 +/- 6.1 84.6 +/- 6.6 87.2 +/- 5.2 88 +/- 5.2 89.4 +/- 4 89 +/- 4.6 89.5 +/- 4.2

Longest 86.6 +/- 4.3 88.2 +/- 4.2 85.6 +/- 4.7 87.3 +/- 6.1 89.4 +/- 4.5 88.2 +/- 3.9 89 +/- 5.4 88.8 +/- 3.6Voted 87.5 +/- 3.3 88.8 +/- 4.8 86.3 +/- 3.9 88.3 +/- 4.2 88.2 +/- 4.9 88.8 +/- 4.8 88.7 +/- 4.9 88.8 +/- 4.8

Table6:

Parameter

Optim

izationR

esultsfor

UC

Iand

ArtificialD

ataSets

242


Nothing τ λ τ×λ τ→ λUSPS

Last 87.4 90.7 87.4 90.3 90.2Longest 87.5 89.9 87.4 90.3 89.9Voted 89.7 90.9 89.6 90.9 90.9

AdultLast 81.9 83.9 83 83.8 83.8

Longest 83.4 83.4 83.4 83.6 83.4Voted 84.4 84.2 84.4 84.3 84.3

MNISTLast 85.6 87.1 85.5 87.1 87.1

Longest 79.8 86.8 79.8 86.8 86.8Voted 88 88.4 88 88.4 88.4

USPS,degree 4 poly kernelLast 92.9 93.6

Longest 91.6 92.9Voted 92.7 93.2

MNIST,degree 4 poly kernelLast 95.2 95.4

Longest 93.2 94.9Voted 94.8 95

Table 7: Parameter Optimization Results for Large Data Sets

Table 6 shows improvement over the basic algorithm in all data sets where parameter searchsuggested a potential for improvement, and no decrease in performance in the other cases, so theparameter selection indeed picks good values. Both τ and the voted perceptron provide consis-tent improvement over the classical perceptron; the longest survivor provides improvement over theclassical perceptron on its own, but a smaller one than voted or τ in most cases. Except for im-provements over the classical perceptron, none of the differences between algorithms is significantaccording to the T -intervals calculated. As observed above in parameter search, the variants with αand λ offer improvements in some cases, but when they do, τ and voted almost always offer a betterimprovement. Ignoring the intervals we also see that neither the τ variant nor the voted perceptrondominates the other. Combining the two is sometimes better but may decrease performance in thehigh variance cases.

For further comparison we also ran experiments with kernel based versions of the algorithms.Typical results in the literature use higher degree polynomial kernel on the “MNIST” and “USPS”data sets. Table 7 includes results using τ with a degree 4 polynomial kernel for these data sets. Wecan see that for “MNIST” the variants make little difference in performance but that for “USPS” weget small improvements and the usual pattern relating the variants is observed.

Finally, Table 6 also gives results for SVM. We have used SVMlight (Joachims, 1999) and ranwith several values for the constants controlling the soft margin. For L1 optimization the valuesused for the “-c” switch in SVMLight are {10−5, 10−4, 10−3, . . .,104}. For the L2 optimization, we

243

KHARDON AND WACHMAN

used the following values for λ: {10−4,10−3, . . . ,103}. The results for SVM are given for the bestparameters in a range of parameters tried. Thus these are parameter search values and essentiallygive upper bounds on the performance of SVM on these data sets. As can be seen the perceptronvariants give similar accuracies and smaller variance and they are therefore an excellent alternativefor SVM.

4. Discussion and Conclusions

The paper provides an experimental evaluation of several noise tolerant variants of the perceptronalgorithm. The results are surprising since they suggest that the perceptron with margin is the mostsuccessful variant although it is the only one not designed for noise tolerance. The voted percep-tron has similar performance in most cases, and it has the advantage that no parameter selectionis required for it. The difference between voted and perceptron with margin are most noticeablein the artificial data sets, and the two are indistinguishable in their performance on the UCI data.The experiments also show that the soft-margin variants are generally weaker than voted or marginbased algorithm and they do not provide additional improvement in performance when combinedwith these. Both the voted perceptron and the margin variant reduced the deviation in accuracy inaddition to improving the accuracy. This is an important property that adds to the stability of the al-gorithms. Combining voted and perceptron with margin has the potential for further improvementsbut can harm performance in high variance cases. In terms of run time, the voted perceptron doesnot require parameter selection and can therefore be faster to train. On the other hand its test timeis slower especially if one runs the primal version of the algorithm.

Overall, the results suggest that a good tradeoff is obtained by fixing a small value of τ; thisgives significant improvements in performance without the training time penalty for parameter op-timization or the penalty in test time for voting. In addition, since we have ruled out the use of thesoft margin α variant we no longer need to run the algorithms for a large number of iterations; onthe other hand, although on-line to batch conversion guarantees only hold for one iteration, experi-mental evidence suggests that a small number of iterations is typically helpful and sufficient.

Our results also highlight the problems involved with parameter selection. The method of doublecross-validation is time intensive and our experiments for the large data sets were performed usingthe primal form of the algorithms since the dual form is too slow. In practice, with a large dataset one can use a hold-out set for parameter selection so that run time is more manageable. Inany case, such results must be accompanied by estimates of the deviation to provide a meaningfulinterpretation.

As mentioned in the introduction a number of variants that perform margin based updates existin the literature (Friess et al., 1998; Gentile, 2001; Li and Long, 2002; Crammer et al., 2005; Kivi-nen et al., 2004; Shalev-Shwartz and Singer, 2005; Tsampouka and Shawe-Taylor, 2005). The algo-rithms vary in some other details. Aggressive ROMMA (Li and Long, 2002) explicitly maximizesthe margin on the new example, relative to an approximation of the constraints from previous exam-ples. NORMA (Kivinen et al., 2004) performs gradient descent on the soft margin risk resulting inan algorithm that rescales the old weight vector before the additive update. The Passive-AggressiveAlgorithm (Crammer et al., 2005) adapts η on each example to guarantee it is immediately separa-ble with margin (although update size is limited per noisy data). The Ballseptron (Shalev-Shwartzand Singer, 2005) establishes a normalized margin and replaces margin updates with updates onhypothetical examples on which a mistake would be made. ALMA (Gentile, 2001) renormalizes

244


the weight vector so as to establish a normalized margin. ALMA is distinguished by tuning its pa-rameters automatically during the on-line session, as well as having “p-norm variants” that can leadto other tradeoffs improving performance, for example, when the target is sparse. The algorithmsof Tsampouka and Shawe-Taylor (2005) establish normalized margin by different normalizationschemes. Following Freund and Schapire (1999) most of these algorithms come with mistake boundguarantees (or relative loss bounds) for the unrealizable case, that is, relative to the best hypothesisin a comparison class. Curiously, identical or similar bounds hold for the basic perceptron algorithmso that these results do not establish an advantage of the variants over the basic perceptron.

Another interesting algorithm, the Second Order Perceptron (Cesa-Bianchi et al., 2005), doesnot perform margin updates but uses spectral properties of the data in the updates in a way that canreduce mistakes in some cases.

Additional variants also exist for the on-line to batch conversions. Littlestone (1989) picksthe best hypothesis using cross validation on a separate validation set. The Pocket algorithm withratchet (Gallant, 1990) evaluates the hypotheses on the entire training set and picks the best, andthe scheme of Cesa-Bianchi et al. (2004) evaluates each hypothesis on the remainder of the trainingset (after it made a mistake and is replaced) and adjusts for the different validation set sizes. Theresults in Cesa-Bianchi et al. (2004) show that loss bounds for the on line setting can be translated toerror bounds in the batch setting even in the unrealizable case. Finally, experiments with aggressiveROMMA (Li and Long, 2002; Li, 2000) have shown that adding a voted perceptron scheme canharm performance, just as we observed for the margin perceptron. To avoid this, Li (2000) developsa scheme that appears to work well where the voting is done on a tail of the sequence of hypotheseswhich is chosen adaptively (see also Dekel and Singer, 2005, for more recent work).

Consider the noise tolerance guarantees for the unrealizable case. Ideally, one would want tofind a hypothesis whose error rate is only a small additive factor away from the error rate of thebest hypothesis in the class of hypotheses under consideration. This is captured by the agnosticPAC learning model (Kearns et al., 1994). It may be worth pointing out here that, although we haverelative loss bounds for several variants and these can be translated to some error bounds in thebatch setting, the bounds are not sufficiently strong to provide significant guarantees in the agnosticPAC model. So this remains a major open problem to be solved.

Seeing our experimental results in light of the discussion above there are several interestingquestions. The first is whether the more sophisticated versions of margin perceptron add significantperformance improvements. In particular it would be useful to know what normalization schemeis useful in what contexts so they can be clearly applied. We have raised parameter tuning as animportant issue in terms of run time and the self tuning capacity of ALMA and related algorithmsseems promising as an effective solution. Given the failure of the simple longest survivor it would beuseful to evaluate the more robust versions of Gallant (1990) and Cesa-Bianchi et al. (2004). Notice,however that these methods have a cost in training time. Alternatively, one could further investigatethe tail variants of the voting scheme (Li, 2000) or the “averaging” version of voting (Freund andSchapire, 1999; Gentile, 2001), explaining when they work with different variants and why. Finally,as mentioned above, to our knowledge there is no theoretical explanation to the fact that perceptronwith margin performs better than the basic perceptron in the presence of noise. Resolving this is animportant problem.

245

KHARDON AND WACHMAN

Acknowledgments

This work was partially supported by NSF grant IIS-0099446 and by Research Semester FellowshipAward from Tufts University. Many of the experiments were run on the Research Linux Clusterprovided by Tufts Computing and Communications Services. We are grateful to Phil Long forpointing out the results in Li (2000) and the reviewers for pointing out Littlestone (1989) and Cesa-Bianchi et al. (2004) and for comments that helped improve the presentation of the paper.

References

B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceed-ings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152, 1992.

N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learningalgorithms. Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.

N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. SIAM Journalon Computing, 34(3):640–668, 2005.

W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference onMachine Learning, pages 115–123, 1995.

M. Collins. Ranking algorithms for named entity extraction: Boosting and the voted perceptron. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages489–496, 2002.

K. Crammer, O. Dekel, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms.Journal of Machine Learning Research, 7:551–585, 2005.

N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines. Cambridge Uni-versity Press, 2000.

O. Dekel and Y. Singer. Data driven online to batch conversions. In Advances in Neural InformationProcessing Systems 18 (Proceedings of NIPS 2005), 2005.

T. Dietterich. An experimental comparison of three methods for constructing ensembles of decisiontrees: Bagging, boosting, and randomization. Machine Learning, 40:139–158, 2000.

T. Dietterich, M. Kearns, and Y. Mansour. Applying the weak learning framework to understandand improve c4.5. In Proceedings of the International Conference on Machine Learning, pages96–104, 1996.

Y. Freund and R. Schapire. Large margin classification using the perceptron algorithm. MachineLearning, 37:277–296, 1999.

T. Friess, N. Cristianini, and C. Campbell. The kernel adatron algorithm: A fast and simple learn-ing procedure for support vector machines. In Proceedings of the International Conference onMachine Learning, pages 188–196, 1998.

246


S. Gallant. Perceptron-based learning algorithms. IEEE Transactions on Neural Networks, 1(2):179–191, 1990.

A. Garg and D. Roth. Margin distribution and learning algorithms. In Proc. of the InternationalConference on Machine Learning (ICML), pages 210–217, 2003.

C. Gentile. A new approximate maximal margin classification algorithm. Journal of MachineLearning Research, 2:213–242, 2001.

T. Graepel, R. Herbrich, and R. Williamson. From margin to sparsity. In Advances in NeuralInformation Processing Systems 13 (Proceedings of NIPS 2000), pages 210–216, 2000.

T. Joachims. Making large-scale svm learning practical. In B. Scholkopf, C. Burges, and A. Smola,editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999.

M. Kearns, M. Li, L. Pitt, and L. G. Valiant. Recent results on boolean concept learning. InProceedings of the Fourth International Workshop on Machine Learning, pages 337–352, 1987.

M. Kearns, R. Schapire, and L. Sellie. Toward efficient agnostic learning. Machine Learning, 17(2/3):115–141, 1994.

J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions onSignal Processing, 52(8):2165–2176, 2004.

A. Kowalczyk, A. Smola, and R. Williamson. Kernel machines and boolean functions. In Advancesin Neural Information Processing Systems 14 (Proceedings of NIPS 2001), pages 439–446, 2001.

W. Krauth and M. Mezard. Learning algorithms with optimal stability in neural networks. Journalof Physics A, 20(11):745–752, 1987.

Y. Li. Selective voting for perception-like online learning. In Proceedings of the InternationalConference on Machine Learning, pages 559–566, 2000.

Y. Li and P. Long. The relaxed online maximum margin algorithm. Machine Learning, 46:361–387,2002.

Y. Li, H. Zaragoza, R. Herbrich, J. Shawe-Taylor, and J. Kandola. The perceptrom algorithm withuneven margins. In International Conference on Machine Learning, pages 379–386, 2002.

N. Littlestone. From on-line to batch learning. In Proceedings of the Conference on ComputationalLearning Theory, pages 269–284, 1989.

T.M. Mitchell. Machine Learning. McGraw-Hill, 1997.

D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz. UCI repository of machine learning databases,1998. URL http://www.ics.uci.edu/˜mlearn/MLRepository.html.

A. B. Novikoff. On convergence proofs on perceptrons. Symposium on the Mathematical Theory ofAutomata, 12:615–622, 1962.

F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization inthe brain. Psychological Review, 65:386–407, 1958.

247

KHARDON AND WACHMAN

R.A. Servedio. On PAC learning using Winnow, Perceptron, and a Perceptron-like algorithm. InProceedings of the Twelfth Annual Conference on Computational learning theory, pages 296–307, 1999.

S. Shalev-Shwartz and Y. Singer. A new perspective on an old perceptron algorithm. In Proceedingsof the Sixteenth Annual Conference on Computational Learning Theory, pages 264–278, 2005.

P. Tsampouka and J. Shawe-Taylor. Analysis of generic perceptron-like large margin classifiers. InProceedings of the European Conference on Machine Learning, pages 750–758, 2005.

L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.

248

noise tolerant variants of the perceptron algorithm

Documents