the boosting approach to machine learning an overview · 1 introduction machine learning studies...
TRANSCRIPT
MSRI Workshop on Nonlinear Estimation and Classification, 2002.
TheBoostingApproachto MachineLearningAn Overview
RobertE. SchapireAT&T Labs � Research
ShannonLaboratory180ParkAvenue,RoomA203FlorhamPark,NJ 07932 USA
www.research.att.com/� schapire
December19,2001
Abstract
Boostingis a generalmethodfor improving the accuracy of any givenlearning algorithm. Focusingprimarily on the AdaBoostalgorithm, thischapteroverviews someof the recentwork on boostingincluding analysesof AdaBoost’s trainingerrorandgeneralizationerror;boosting’sconnectionto gametheoryandlinear programming;the relationshipbetweenboostingandlogistic regression;extensionsof AdaBoostfor multiclassclassificationproblems;methodsof incorporatinghumanknowledgeinto boosting;andexperimentalandappliedwork usingboosting.
1 Intr oduction
Machinelearningstudiesautomatictechniquesfor learningto make accuratepre-dictionsbasedon pastobservations. For example,supposethatwe would like tobuild an email filter that candistinguishspam(junk) email from non-spam.Themachine-learningapproachto this problemwould bethefollowing: Startby gath-eringasmany examplesasposibleof bothspamandnon-spamemails.Next, feedtheseexamples,togetherwith labelsindicating if they arespamor not, to yourfavorite machine-learningalgorithmwhich will automaticallyproducea classifi-cationor predictionrule. Given a new, unlabeledemail, sucha rule attemptstopredictif it is spamor not. Thegoal,of course,is to generatea rule thatmakesthemostaccuratepredictionspossibleonnew testexamples.
1
Building a highly accuratepredictionrule is certainlya difficult task. On theotherhand,it is not hardat all to comeup with very roughrulesof thumbthatareonly moderatelyaccurate.An exampleof sucha rule is somethinglike thefollowing: “If thephrase‘buy now’ occursin theemail, thenpredict it is spam.”Sucha rule will not evencomecloseto coveringall spammessages;for instance,it really saysnothing aboutwhat to predict if ‘buy now’ doesnot occur in themessage.On theotherhand,this rule will make predictionsthataresignificantlybetterthanrandomguessing.
Boosting,the machine-learningmethodthat is the subjectof this chapter, isbasedontheobservationthatfindingmany roughrulesof thumbcanbealot easierthanfinding a single,highly accuratepredictionrule. To apply the boostingap-proach,we startwith a methodor algorithmfor finding theroughrulesof thumb.Theboostingalgorithmcallsthis “weak” or “base” learningalgorithmrepeatedly,eachtime feedingit adifferentsubsetof thetrainingexamples(or, to bemorepre-cise,a differentdistribution or weightingover thetrainingexamples1). Eachtimeit is called,thebaselearningalgorithmgeneratesa new weakpredictionrule, andafter many rounds,the boostingalgorithmmustcombinetheseweakrulesinto asinglepredictionrule that,hopefully, will bemuchmoreaccuratethanany oneoftheweakrules.
To make thisapproachwork, therearetwo fundamentalquestionsthatmustbeanswered:first, how shouldeachdistributionbechosenoneachround,andsecond,how shouldtheweakrulesbecombinedinto a singlerule? Regardingthechoiceof distribution, the techniquethatwe advocateis to placethemostweighton theexamplesmostoftenmisclassifiedby theprecedingweakrules;this hastheeffectof forcing the baselearnerto focus its attentionon the “hardest”examples. Asfor combiningthe weakrules,simply taking a (weighted)majority vote of theirpredictionsis naturalandeffective.
Thereis alsothe questionof what to usefor the baselearningalgorithm,butthisquestionwepurposelyleaveunansweredsothatwewill endupwith ageneralboostingprocedurethatcanbecombinedwith any baselearningalgorithm.
Boostingrefersto ageneralandprovablyeffectivemethodof producingaveryaccuratepredictionrule by combiningroughandmoderatelyinaccuraterulesofthumb in a mannersimilar to that suggestedabove. This chapterpresentsanoverview of someof therecentwork on boosting,focusingespeciallyon theAda-Boostalgorithmwhichhasundergoneintensetheoreticalstudyandempiricaltest-ing.
1A distribution over trainingexamplescanbeusedto generatea subsetof thetrainingexamplessimplyby samplingrepeatedlyfrom thedistribution.
2
Given:��������� � ���������������������
where�������
,������! #"%$'&(� )*&,+
Initialize - � ��./�0 1&�2�3.
For 4 1&(���������65:7 Trainbaselearnerusingdistribution -*8 .7 Getbaseclassifier9 8;: �=<?>
.7 Choose@ 8 ��> .7 Update: - 8BA ����./�C -*8 ��./�EDGFIH��/$ @J8 � 9K8 ��� � ��L 8where
L 8 is a normalizationfactor(chosenso that - 8BA � will be a distribu-tion).
Outputthefinal classifier:M ���N�0 PORQTS%UWV�XY8BZ � @ 8 9 8 ���[�/\]�Figure1: TheboostingalgorithmAdaBoost.
2 AdaBoost
Working in Valiant’s PAC (probablyapproximatelycorrect)learningmodel[75],KearnsandValiant[41, 42] werethefirst to posethequestionof whethera “weak”learningalgorithmthatperformsjust slightly betterthanrandomguessingcanbe“boosted” into an arbitrarily accurate“strong” learningalgorithm. Schapire[66]cameup with the first provablepolynomial-timeboostingalgorithmin 1989. Ayearlater, Freund[26] developedamuchmoreefficientboostingalgorithmwhich,althoughoptimalin acertainsense,neverthelesssufferedlikeSchapire’s algorithmfrom certainpracticaldrawbacks.Thefirst experimentswith theseearlyboostingalgorithmswerecarriedoutby Drucker, SchapireandSimard[22] onanOCRtask.
The AdaBoostalgorithm, introducedin 1995 by Freundand Schapire[32],solved many of thepracticaldifficulties of theearlierboostingalgorithms,andisthefocusof this paper. Pseudocodefor AdaBoostis givenin Fig. 1 in theslightlygeneralizedform givenby SchapireandSinger[70]. Thealgorithmtakesasinputa training set
��������� � ���������������^���_�whereeach
���belongsto somedomainor
instancespace�
, andeachlabel��
is in somelabelset�
. For mostof thispaper,we assume
�= `"%$^&(� )*&,+; in Section7, we discussextensionsto themulticlass
case.AdaBoostcallsagivenweakor baselearningalgorithmrepeatedlyin aseries
3
of rounds4 a&(���������65. Oneof the main ideasof thealgorithmis to maintaina
distributionor setof weightsoverthetrainingset.Theweightof thisdistributionontrainingexample
.onround 4 is denoted- 8 ��./� . Initially, all weightsaresetequally,
but on eachround,theweightsof incorrectlyclassifiedexamplesareincreasedsothatthebaselearneris forcedto focuson thehardexamplesin thetrainingset.
The baselearner’s job is to find a baseclassifier 9 8�: � < >appropriate
for thedistribution - 8 . (Baseclassifierswerealsocalledrulesof thumbor weakpredictionrulesin Section1.) In thesimplestcase,therangeof each9K8 is binary,i.e., restrictedto
"%$'&(� )*&,+; thebaselearner’s job thenis to minimizetheerrorb 8 PcCd �feNgihkj 9K8 ��� � ��l P �Bm �
Oncethebaseclassifier 9 8 hasbeenreceived,AdaBoostchoosesa parameter@ 8 �W> that intuitively measurestheimportancethatit assignsto 9 8 . In thefigure,we have deliberatelyleft thechoiceof @ 8 unspecified.For binary 9 8 , we typicallyset @ 8 �n;o U�p &q$ b 8b 8sr (1)
asin theoriginaldescriptionof AdaBoostgivenby FreundandSchapire[32]. Moreon choosing@ 8 follows in Section3. Thedistribution - 8 is thenupdatedusingtheruleshown in thefigure.Thefinal or combinedclassifier
Mis aweightedmajority
voteof the5
baseclassifierswhere@J8 is theweightassignedto 9K8 .3 Analyzing the training error
The most basic theoreticalpropertyof AdaBoostconcernsits ability to reducethe training error, i.e., the fraction of mistakes on the training set. Specifically,SchapireandSinger[70], in generalizinga theoremof FreundandSchapire[32],show thatthetrainingerrorof thefinal classifieris boundedasfollows:&3ut "�. : M �����v��l P��w+ t�x &3 Y � DGFIHJ�/$y���z;�����w��] |{ 8 L 8 (2)
wherehenceforthwedefine z;���[�; Y 8 @ 8 9 8 ���[� (3)
so thatM ���[�} ~O�QTS%U��vz;���[��
. (For simplicity of notation,we write � � and � 8 asshorthandfor � �� Z � and ��X8BZ � , respectively.) Theinequalityfollows from thefactthat �(���R�B�,�������0� & if
� l M ��� � �. Theequalitycanbeprovedstraightforwardly by
unraveling therecursive definitionof - 8 .4
Eq.(2) suggeststhatthetrainingerrorcanbereducedmostrapidly(in agreedyway) by choosing@ 8 and 9 8 oneachroundto minimizeL 8 Y � - 8 ��./�EDGF�H��/$ @ 8 �� 9 8 �����v�� � (4)
In thecaseof binaryclassifiers,this leadsto thechoiceof @ 8 givenin Eq. (1) andgivesaboundon thetrainingerrorof{ 8 L 8 �{ 8!�B��� b 8 �6&q$ b 8 ���# �{ 8 � &�$]�(� n8 x DGF�HWVk$ � Y 8 � n8 \ (5)
wherewe define� 8 �&�2 � $ b 8 . This boundwas first proved by Freundand
Schapire[32]. Thus,if eachbaseclassifieris slightly betterthanrandomso that� 8 � � for some�]�P�
, thenthetrainingerrordropsexponentiallyfastin5
sincethe boundin Eq. (5) is at most ��� n X���� . This bound,combinedwith the boundson generalizationerror given below prove thatAdaBoostis indeeda boostingal-gorithm in thesensethat it canefficiently convert a trueweaklearningalgorithm(that canalwaysgeneratea classifierwith a weakedgefor any distribution) intoa stronglearningalgorithm(that cangeneratea classifierwith an arbitrarily lowerrorrate,givensufficient data).
Eq. (2) pointsto the fact that,at heart,AdaBoostis a procedurefor finding alinearcombination
zof baseclassifierswhichattemptsto minimizeY � DGFIHJ�/$y���z;��������0 Y � DGF�HWVk$y�� Y 8 @ 8 9 8 �����v��\]� (6)
Essentially, on eachround,AdaBoostchooses9K8 (by calling thebaselearner)andthensets@ 8 to addonemoretermto theaccumulatingweightedsumof baseclassi-fiersin sucha way thatthesumof exponentialsabove will bemaximallyreduced.In otherwords,AdaBoostis doinga kind of steepestdescentsearchto minimizeEq. (6) wherethe searchis constrainedat eachstepto follow coordinatedirec-tions(wherewe identify coordinateswith theweightsassignedto baseclassifiers).This view of boostingandits generalizationareexaminedin considerabledetailby Duffy andHelmbold[23], Masonet al. [51, 52] andFriedman[35]. SeealsoSection6.
SchapireandSinger[70] discussthe choiceof @ 8 and 9 8 in the casethat 9 8is real-valued (rather than binary). In this case, 9K8 ���N� can be interpretedas a“confidence-ratedprediction” in which the sign of 9 8 ���[� is the predictedlabel,while the magnitudet 9K8 ���[� t givesa measureof confidence.Here,SchapireandSingeradvocatechoosing@ 8 and 9 8 soasto minimize
L 8 (Eq.(4)) on eachround.
5
4 Generalizationerror
In studyinganddesigninglearningalgorithms,we areof courseinterestedin per-formanceonexamplesnotseenduringtraining,i.e., in thegeneralizationerror, thetopic of this section.Unlike Section3 wherethetrainingexampleswerearbitrary,herewe assumethat all examples(both train and test)aregeneratedi.i.d. fromsomeunknown distribution on
�����. Thegeneralizationerror is theprobability
of misclassifyinganew example,while thetesterroris thefractionof mistakesona newly sampledtestset(thus,generalizationerror is expectedtesterror). Also,for simplicity, we restrictourattentionto binarybaseclassifiers.
FreundandSchapire[32] showedhow to boundthegeneralizationerrorof thefinal classifierin termsof its training error, the size
3of the sample,the VC-
dimension2 � of thebaseclassifierspaceandthenumberof rounds5
of boosting.Specifically, they usedtechniquesfrom BaumandHaussler[5] to show that thegeneralizationerror, with highprobability, is atmost3 c0d j M ���[�_l ¡ m )�¢£¥¤¦J§ 5 �3©¨ªwhere
cCd j�«¬mdenotesempiricalprobabilityon thetrainingsample.Thisboundsug-
geststhatboostingwill overfit if run for toomany rounds,i.e.,as5
becomeslarge.In fact, this sometimesdoeshappen.However, in early experiments,several au-thors [8, 21, 59] observed empirically that boostingoften doesnot overfit, evenwhenrunfor thousandsof rounds.Moreover, it wasobservedthatAdaBoostwouldsometimescontinueto drive down thegeneralizationerror long after the trainingerror had reachedzero, clearly contradictingthe spirit of the boundabove. Forinstance,theleft sideof Fig. 2 shows thetrainingandtestcurvesof runningboost-ing on top of Quinlan’s C4.5decision-treelearningalgorithm[60] on the “letter”dataset.
In responseto theseempiricalfindings,Schapireetal. [69], following theworkof Bartlett [3], gave analternative analysisin termsof themarginsof thetrainingexamples.Themargin of example
���J�K�is definedto be
¯® dRS%QBU � ���i���; Kz;���[�Y 8 t @ 8 t Y 8 @ 8 9 8 ���[�Y 8 t @ 8 t �
2TheVapnik-Chervonenkis(VC) dimensionis astandardmeasureof the“complexity” of aspaceof binaryfunctions.See,for instance,refs.[6, 76] for its definitionandrelationto learningtheory.
3The“soft-Oh” notation °±³²�´ µ , hereusedratherinformally, is meantto hideall logarithmicandconstantfactors(in thesameway thatstandard“big-Oh” notationhidesonly constantfactors).
6
10 100 10000
5
10
15
20
erro
r
# rounds-1 -0.5 0.5 1
0.5
1.0
cum
ulat
ive
dist
ribut
ion
margin
Figure 2: Error curves and the margin distribution graphfor boostingC4.5 onthe letter datasetas reportedby Schapireet al. [69]. Left: the training and testerror curves(lower anduppercurves, respectively) of the combinedclassifierasa functionof thenumberof roundsof boosting.Thehorizontallines indicatethetesterror rateof the baseclassifieraswell asthe testerror of thefinal combinedclassifier. Right: Thecumulative distribution of marginsof the trainingexamplesafter 5, 100 and1000iterations,indicatedby short-dashed,long-dashed(mostlyhidden)andsolidcurves,respectively.
It is a numberinj $'&(� )*& m
andis positive if andonly ifM
correctlyclassifiestheexample.Moreover, asbefore,themagnitudeof themargin canbeinterpretedasameasureof confidencein theprediction.Schapireet al. provedthatlargermarginsonthetrainingsettranslateinto asuperiorupperboundonthegeneralizationerror.Specifically, thegeneralizationerroris at most cCdy¶ ¯® dRS%QfU � ���J�K� x¸·,¹ )?¢£ ¤¦J§ �3 · n ¨ªfor any · �1�
with high probability. Note that this boundis entirely independentof5
, the numberof roundsof boosting. In addition,Schapireet al. proved thatboostingis particularlyaggressive at reducingthemargin (in a quantifiablesense)sinceit concentrateson theexampleswith thesmallestmargins (whetherpositiveor negative). Boosting’seffectonthemarginscanbeseenempirically, for instance,ontheright sideof Fig.2whichshowsthecumulativedistributionof marginsof thetrainingexampleson the“letter” dataset.In this case,evenafterthetrainingerrorreacheszero,boostingcontinuesto increasethemarginsof the trainingexampleseffectingacorrespondingdropin thetesterror.
Althoughthemarginstheorygivesaqualitativeexplanationof theeffectivenessof boosting,quantitatively, theboundsareratherweak.Breiman[9], for instance,
7
shows empirically that one classifiercan have a margin distribution that is uni-formly betterthanthatof anotherclassifier, andyetbeinferior in testaccuracy. Ontheotherhand,Koltchinskii,Panchenko andLozano[44, 45, 46, 58] have recentlyprovednew margin-theoreticboundsthataretight enoughto give usefulquantita-tive predictions.
Attempts(not alwayssuccessful)to usethe insightsgleanedfrom the theoryof marginshave beenmadeby severalauthors[9, 37, 50]. In addition,themargintheorypointsto a strongconnectionbetweenboostingandthesupport-vectorma-chinesof Vapnikandothers[7, 14, 77] which explicitly attemptto maximizetheminimummargin.
5 A connectionto gametheory and linear programming
The behavior of AdaBoostcanalsobe understoodin a game-theoreticsettingasexploredby FreundandSchapire[31, 33] (seealsoGrove andSchuurmans[37]andBreiman[9]). In classicalgametheory, it is possibleto put any two-person,zero-sumgamein theform of amatrix º . To play thegame,oneplayerchoosesarow
.andtheotherplayerchoosesa column » . Thelossto therow player(which
is the sameasthe payoff to the columnplayer) is º �½¼ . More generally, the twosidesmayplay randomly, choosingdistributions ¾ and ¿ over rows or columns,respectively. Theexpectedlossthenis ¾}ÀJºÁ¿ .
Boostingcanbeviewedasrepeatedplay of a particulargamematrix. Assumethat the baseclassifiersarebinary, and let  Ã" 9 �����T�T�T� 9KÄ + be the entirebaseclassifierspace(which we assumefor now to be finite). For a fixed training set��������� � ���������������^�����
, thegamematrix º has3
rows and Å columnswhere
º �½¼Æ ~Ç &if 9 ¼È�����É�; ¡���otherwise.
The row playernow is theboostingalgorithm,andthecolumnplayeris thebaselearner. Theboostingalgorithm’s choiceof a distribution - 8 over trainingexam-plesbecomesadistribution ¾ over rowsof º , while thebaselearner’s choiceof abaseclassifier9 8 becomesthechoiceof acolumn » of º .
As anexampleof theconnectionbetweenboostingandgametheory, considervon Neumann’s famousminmaxtheoremwhichstatesthatÊ® FË QBUÌ ¾ À ºÁ¿ QfUÌ ¯® FË ¾ À ºÁ¿for any matrix º . Whenappliedto the matrix just definedandreinterpretedintheboostingsetting,this canbeshown to have thefollowing meaning:If, for any
8
distribution overexamples,thereexistsabaseclassifierwith erroratmost&�2 � $³� ,
thenthereexistsa convex combinationof baseclassifierswith a margin of at least� � on all training examples. AdaBoostseeksto find sucha final classifierwithhighmargin onall examplesby combiningmany baseclassifiers;soin asense,theminmaxtheoremtellsusthatAdaBoostat leasthasthepotentialfor successsince,given a “good” baselearner, theremustexist a goodcombinationof baseclassi-fiers. Goingmuchfurther, AdaBoostcanbeshown to bea specialcaseof a moregeneralalgorithmfor playingrepeatedgames,or for approximatelysolvingmatrixgames.This shows that,asymptotically, thedistribution over trainingexamplesaswell astheweightsover baseclassifiersin thefinal classifierhave game-theoreticintepretationsasapproximateminmaxor maxminstrategies.
The problemof solving (finding optimal strategies for) a zero-sumgameiswell known to besolvableusinglinearprogramming.Thus,this formulationof theboostingproblemasa gamealsoconnectsboostingto linear, andmoregenerallyconvex, programming.This connectionhasled to new algorithmsandinsightsasexploredby Ratschet al. [62], Grove andSchuurmans[37] andDemiriz, BennettandShawe-Taylor [17].
In anotherdirection,Schapire[68] describesandanalyzesthe generalizationof both AdaBoostand Freund’s earlier “boost-by-majority” algorithm [26] to abroaderfamily of repeatedgamescalled“drifting games.”
6 Boostingand logistic regression
Classificationgenerallyis the problemof predictingthe label
of an example�
with the intentionof minimizing theprobabilityof an incorrectprediction.How-ever, it is oftenusefulto estimatetheprobability of a particularlabel. Friedman,HastieandTibshirani[34] suggestedamethodfor usingtheoutputof AdaBoosttomakereasonableestimatesof suchprobabilities.Specifically, they suggestedusinga logistic function,andestimatingcCd � j Í Î)*& t � m ���,������ �,����� ) � ���,�Ï��� (7)
where,asusual,z;���[�
is theweightedaverageof baseclassifiersproducedby Ada-Boost(Eq. (3)). Therationalefor this choiceis thecloseconnectionbetweenthelog loss(negative log likelihood)of sucha model,namely,Y � o UÊÐ�&Ñ) � � n �R���,�Ï���B�ÉÒ (8)
9
andthefunctionthat,wehave alreadynoted,AdaBoostattemptsto minimize:Y � � ���R���,������ � (9)
Specifically, it canbeverifiedthatEq.(8) is upperboundedby Eq.(9). In addition,if we addtheconstant
&q$ o U � to Eq.(8) (whichdoesnotaffect its minimization),thenit canbeverifiedthat theresultingfunctionandtheonein Eq. (9) have iden-tical Taylor expansionsaroundzeroup to secondorder; thus,their behavior nearzerois very similar. Finally, it canbe shown that, for any distribution over pairs���i���
, theexpectations Ó ¶ o UÊÐ,&C) � � n �Ô�,����� Ò ¹and Ó ¶ � ���Ô�,����� ¹areminimizedby thesame(unconstrained)function
z, namely,z;���N�; �n;o U p cCd j Í Î)*& t � mcCd j Í 1$'& t � m r �
Thus, for all thesereasons,minimizing Eq. (9), as is doneby AdaBoost,canbeviewedasamethodof approximatelyminimizingthenegative log likelihoodgivenin Eq. (8). Therefore,we may expect Eq. (7) to give a reasonableprobabilityestimate.
Of course,asFriedman,HastieandTibshiranipoint out, ratherthanminimiz-ing theexponentiallossin Eq. (6), we couldattemptinsteadto directly minimizethe logistic lossin Eq. (8). To this end,they proposetheir LogitBoostalgorithm.A different,moredirect modificationof AdaBoostfor logistic losswasproposedby Collins,SchapireandSinger[13]. Following up on work by KivinenandWar-muth[43] andLafferty [47], they derive thisalgorithmusingaunificationof logis-tic regressionandboostingbasedon Bregmandistances.This work further con-nectsboostingto themaximum-entropy literature,particularlytheiterative-scalingfamily of algorithms[15, 16]. They also give unified proofs of convergencetooptimality for a family of new andold algorithms,including AdaBoost,for boththe exponentiallossusedby AdaBoostandthe logistic lossusedfor logistic re-gression.Seealsothe later work of LebanonandLafferty [48] who showed thatlogistic regressionandboostingarein factsolvingthesameconstrainedoptimiza-tion problem,exceptthat in boosting,certainnormalizationconstraintshave beendropped.
For logistic regression,we attemptto minimizethelossfunctionY � o U Ð &Õ) � ���R�B�,������� Ò (10)
10
which is thesameasin Eq. (8) exceptfor aninconsequentialchangeof constantsin theexponent.Themodificationof AdaBoostproposedby Collins,SchapireandSingerto handlethis lossfunctionis particularlysimple.In AdaBoost,unravelingthedefinitionof - 8 givenin Fig. 1 shows that - 8 ��./� is proportional(i.e.,equalupto normalization)to DGF�H^�/$Ö��vz 8 � �������É��wherewe define z 8 ���[�C 8Y8T×ØZ � @ 8T× 9 8T× ���N� �To minimize the loss function in Eq. (10), the only necessarymodificationis toredefine-*8 ��./� to beproportionalto &&Õ)ÙDGF�H^�� � z 8 � � ��� � �� �A very similar algorithmis describedby Duffy andHelmbold[23]. Note that ineachcase,theweight on the examples,viewed asa vector, is proportionalto thenegative gradientof the respective lossfunction. This is becausebothalgorithmsaredoinga kind of functionalgradientdescent,anobservation that is spelledoutandexploitedby Breiman[9], Duffy andHelmbold[23], Masonetal. [51, 52] andFriedman[35].
Besideslogistic regression,therehave beena numberof approachestaken toapplyboostingto moregeneralregressionproblemsin which thelabels
�arereal
numbersandthegoalis to producereal-valuedpredictionsthatarecloseto thesela-bels.Someof these,suchasthoseof Ridgeway[63] andFreundandSchapire[32],attemptto reducetheregressionproblemto a classificationproblem.Others,suchasthoseof Friedman[35] andDuffy andHelmbold[24] usethefunctionalgradientdescentview of boostingto derive algorithmsthatdirectly minimize a lossfunc-tion appropriatefor regression. Another boosting-basedapproachto regressionwasproposedby Drucker [20].
7 Multiclass classification
Thereareseveralmethodsof extendingAdaBoostto themulticlasscase.Themoststraightforward generalization[32], called AdaBoost.M1,is adequatewhen thebaselearneris strongenoughto achieve reasonablyhigh accuracy, even on thehard distributions createdby AdaBoost. However, this methodfails if the baselearnercannotachieve at least50%accuracy whenrunon theseharddistributions.
11
For the latter case,several moresophisticatedmethodshave beendeveloped.Thesegenerallywork by reducingthemulticlassproblemto a largerbinaryprob-lem. SchapireandSinger’s [70] algorithmAdaBoost.MHworksby creatinga setof binary problems,for eachexample
�andeachpossiblelabel
, of the form:
“For example�, is the correctlabel
or is it oneof the other labels?” Freund
andSchapire’s [32] algorithmAdaBoost.M2(which is a specialcaseof SchapireandSinger’s [70] AdaBoost.MRalgorithm) insteadcreatesbinary problems,foreachexample
�with correctlabel
andeachincorrect label
IÚof theform: “For
example�, is thecorrectlabel
or Ú
?”Thesemethodsrequireadditionaleffort in thedesignof thebaselearningalgo-
rithm. A differenttechnique[67], which incorporatesDietterichandBakiri’s [19]methodof error-correctingoutputcodes,achievessimilarprovableboundsto thoseof AdaBoost.MHandAdaBoost.M2,but canbe usedwith any baselearnerthatcan handlesimple, binary labeleddata. Schapireand Singer[70] and All wein,SchapireandSinger[2] giveyetanothermethodof combiningboostingwith error-correctingoutputcodes.
8 Incorporating human knowledge
Boosting,likemany machine-learningmethods,is entirelydata-drivenin thesensethat the classifierit generatesis derived exclusively from the evidencepresentinthetrainingdataitself. Whendatais abundant,this approachmakessense.How-ever, in someapplications,datamaybeseverely limited, but theremaybehumanknowledgethat,in principle,mightcompensatefor thelackof data.
In itsstandardform,boostingdoesnotallow for thedirectincorporationof suchprior knowledge.Nevertheless,Rocheryet al. [64, 65] describea modificationofboostingthatcombinesandbalanceshumanexpertisewith availabletrainingdata.The aim of the approachis to allow the human’s roughjudgmentsto be refined,reinforcedandadjustedby thestatisticsof the trainingdata,but in a mannerthatdoesnotpermitthedatato entirelyoverwhelmhumanjudgments.
The first stepin this approachis for a humanexpert to constructby handarule Û mappingeachinstance
�to an estimatedprobability Û ���N�Ü� j ����& m
that isinterpretedas the guessedprobability that instance
�will appearwith label
)*&.
Therearevariousmethodsfor constructingsucha function Û , andthehopeis thatthis difficult-to-build functionneednot be highly accuratefor theapproachto beeffective.
Rocheryet al.’s basicidea is to replacethe logistic lossfunction in Eq. (10)
12
with onethatincorporatesprior knowledge,namely,Y � o U Ð &C) � ���R���,������� Ò )ÞÝ Y �àß Ó p Û �����w�yá &&Õ) � ���,�Ï����� rwhere ß Ó � Û áÑâ(�³ Û o U�� Û 2,â%�Õ)ã�6&ä$ Û � o U���6&'$ Û �2��6&^$åâ%�� is binary relativeentropy. The first term is the sameasthat in Eq. (10). The secondterm givesameasureof the distancefrom the modelbuilt by boostingto the human’s model.Thus,we balancetheconditionallikelihoodof thedataagainstthedistancefromour model to the human’s model. The relative importanceof the two termsiscontrolledby theparameter
Ý.
9 Experimentsand applications
Practically, AdaBoosthasmany advantages. It is fast, simple and easyto pro-gram.It hasno parametersto tune(exceptfor thenumberof round
5). It requires
no prior knowledgeaboutthe baselearnerandso canbe flexibly combinedwithanymethodfor finding baseclassifiers.Finally, it comeswith a setof theoreticalguaranteesgiven sufficient dataanda baselearnerthat canreliably provide onlymoderatelyaccuratebaseclassifiers.This is a shift in mind setfor the learning-systemdesigner:insteadof trying to designa learningalgorithmthat is accurateover theentirespace,wecaninsteadfocusonfindingbaselearningalgorithmsthatonly needto bebetterthanrandom.
Ontheotherhand,somecaveatsarecertainlyin order. Theactualperformanceof boostingon a particularproblemis clearlydependenton thedataandthebaselearner. Consistentwith theory, boostingcanfail to performwell giveninsufficientdata,overly complex baseclassifiersor baseclassifiersthataretooweak.Boostingseemsto beespeciallysusceptibleto noise[18] (moreon this in Sectionsec:exps).
AdaBoosthasbeentestedempirically by many researchers,including [4, 18,21, 40, 49, 59, 73]. For instance,FreundandSchapire[30] testedAdaBooston asetof UCI benchmarkdatasets[54] usingC4.5[60] asa baselearningalgorithm,aswell asanalgorithmthatfindsthebest“decisionstump”or single-testdecisiontree.Someof theresultsof theseexperimentsareshown in Fig. 3. As canbeseenfrom this figure,evenboostingtheweakdecisionstumpscanusuallygive asgoodresultsasC4.5,while boostingC4.5generallygivesthedecision-treealgorithmasignificantimprovementin performance.
In anothersetof experiments,SchapireandSinger[71] usedboostingfor textcategorizationtasks.For thiswork, baseclassifierswereusedthatteston thepres-enceor absenceof awordor phrase.Someresultsof theseexperimentscomparing
13
0 5 10 15 20 25 30
boosting stumps
0
5
10
15
20
25
30C
4.5
0 5 10 15 20 25 30
boosting C4.5
0
5
10
15
20
25
30
C4.
5
0 5 10 15 20 25 30
boosting stumps
0
5
10
15
20
25
30C
4.5
0 5 10 15 20 25 30
boosting C4.5
0
5
10
15
20
25
30
C4.
5Figure3: Comparisonof C4.5versusboostingstumpsandboostingC4.5on a setof 27 benchmarkproblemsasreportedby FreundandSchapire[30]. Eachpointin eachscatterplotshows the testerror rateof the two competingalgorithmsona singlebenchmark.The æ -coordinateof eachpoint gives the testerror rate (inpercent)of C4.5on thegivenbenchmark,andthe ç -coordinategivestheerrorrateof boostingstumps(left plot) or boostingC4.5 (right plot). All error rateshavebeenaveragedover multiple runs.
AdaBoostto four othermethodsareshown in Fig. 4. In nearlyall of theseex-perimentsandfor all of the performancemeasurestested,boostingperformedaswell or significantlybetterthantheothermethodstested.As shown in Fig. 5, theseexperimentsalsodemonstratedtheeffectivenessof usingconfidence-ratedpredic-tions[70], mentionedin Section3 asameansof speedingup boosting.
Boostinghasalsobeenappliedto text filtering [72] androuting[39], “ranking”problems[28], learningproblemsarisingin naturallanguageprocessing[1, 12, 25,38, 55,78], imageretrieval [74], medicaldiagnosis[53], andcustomermonitoringandsegmentation[56, 57].
Rocheryetal.’s [64, 65] methodof incorporatinghumanknowledgeinto boost-ing, describedin Section8, wasappliedto two speechcategorizationtasks.In thiscase,theprior knowledgetook theform of a setof hand-built rulesmappingkey-wordsto predictedcategories.Theresultsareshown in Fig. 6.
The final classifierproducedby AdaBoostwhen used,for instance,with adecision-treebaselearningalgorithm,canbe extremelycomplex anddifficult tocomprehend.With greatercare,a morehuman-understandable final classifiercanbeobtainedusingboosting.CohenandSinger[11] showedhow to designa base
14
0
2
4
6
8
10
12
14
16
3 4 5 6
% E
rrorè
Number of Classes
AdaBoostSleeping-experts
RocchioNaive-Bayes
PrTFIDF5
10
15
20
25
30
35
4 6 8 10 12 14 16 18 20
% E
rroré
Number of Classes
AdaBoostSleeping-experts
RocchioNaive-Bayes
PrTFIDF
Figure4: Comparisonof error ratesfor AdaBoostandfour othertext categoriza-tion methods(naive Bayes,probabilisticTF-IDF, Rocchioandsleepingexperts)asreportedby SchapireandSinger[71]. Thealgorithmsweretestedon two textcorpora— Reutersnewswirearticles(left) andAP newswireheadlines(right) —andwith varyingnumbersof classlabelsasindicatedon the ê -axisof eachfigure.
learningalgorithmthat,whencombinedwith AdaBoost,resultsin afinal classifierconsistingof a relatively small setof rulessimilar to thosegeneratedby systemslike RIPPER[10], IREP [36] and C4.5rules[60]. Cohenand Singer’s system,calledSLIPPER,is fast,accurateandproducesquite compactrule sets. In otherwork, FreundandMason[29] showedhow to applyboostingto learnageneraliza-tion of decisiontreescalled“alternatingtrees.” Their algorithmproducesa singlealternatingtreeratherthananensembleof treesaswould beobtainedby runningAdaBooston top of a decision-treelearningalgorithm. On the otherhand,theirlearningalgorithmachieveserror ratescomparableto thoseof a whole ensembleof trees.
A nice propertyof AdaBoostis its ability to identify outliers, i.e., examplesthatareeithermislabeledin thetrainingdata,or thatareinherentlyambiguousandhardto categorize.BecauseAdaBoostfocusesits weighton thehardestexamples,theexampleswith thehighestweightoftenturn out to beoutliers.An exampleofthis phenomenoncanbeseenin Fig. 7 taken from anOCRexperimentconductedby FreundandSchapire[30].
Whenthenumberof outliersis very large,theemphasisplacedonthehardex-amplescanbecomedetrimentalto theperformanceof AdaBoost.Thiswasdemon-stratedvery convincingly by Dietterich[18]. Friedman,HastieandTibshirani[34]suggestedavariantof AdaBoost,called“GentleAdaBoost”thatputslessemphasison outliers. Ratsch,OnodaandMuller [61] show how to regularizeAdaBoosttohandlenoisydata.Freund[27] suggestedanotheralgorithm,called“BrownBoost,”thattakesamoreradicalapproachthatde-emphasizesoutlierswhenit seemsclearthat they are“too hard” to classifycorrectly. This algorithm,which is anadaptive
15
10
20
30
40
50
60
70
1 10 100 1000 10000
% E
rrorë
Number of rounds
discrete AdaBoost.MRdiscrete AdaBoost.MH
real AdaBoost.MH
10
20
30
40
50
60
70
1 10 100 1000 10000
% E
rrorë
Number of rounds
discrete AdaBoost.MRdiscrete AdaBoost.MH
real AdaBoost.MH
Figure5: Comparisonof thetraining(left) andtest(right) errorusingthreeboost-ing methodson a six-classtext classificationproblemfrom theTREC-APcollec-tion, as reportedby Schapireand Singer[70, 71]. DiscreteAdaBoost.MHanddiscreteAdaBoost.MRare multiclassversionsof AdaBoostthat requirebinary("%$'&(� )*&,+
-valued)baseclassifiers,while real AdaBoost.MHis a multiclassver-sionthatuses“confidence-rated”(i.e., real-valued)baseclassifiers.
versionof Freund’s [26] “boost-by-majority”algorithm,demonstratesan intrigu-ing connectionbetweenboostingandBrownianmotion.
10 Conclusion
In this overview, we have seenthat therehave emerged a greatmany views orinterpretationsof AdaBoost.First andforemost,AdaBoostis a genuineboostingalgorithm: givenaccessto a trueweaklearningalgorithmthatalwaysperformsalittle bit betterthanrandomguessingoneverydistribution over thetrainingset,wecanprove arbitrarily goodboundson thetrainingerrorandgeneralizationerrorofAdaBoost.
Besidesthisoriginalview, AdaBoosthasbeeninterpretedasaprocedurebasedon functionalgradientdescent,asan approximationof logistic regressionandasa repeated-gameplaying algorithm. AdaBoosthas also beenshown to be re-lated to many other topics,suchasgametheoryand linear programming,Breg-mandistances,support-vectormachines,Brownianmotion,logistic regressionandmaximum-entropy methodssuchasiterative scaling.
All of theseconnectionsandinterpretationshave greatlyenhancedour under-standingof boostingand contributed to its extensionin ever more practicaldi-rections,suchas to logistic regressionandother loss-minimizationproblems,tomulticlassproblems,to incorporateregularizationandto allow the integrationofprior backgroundknowledge.
16
0 200 400 600 800 1000 1200 1400 160074
76
78
80
82
84
86
88
90
92
# Training Sentences
Cla
ssifi
catio
n A
ccur
acy
(%)
data
knowledge
knowledge + data
0 500 1000 1500 2000 2500 300045
50
55
60
65
70
75
80
85
90
# Training Examples
Cla
ssifi
catio
n A
ccur
acy
(%)
data
knowledge
knowledge + data
Figure6: Comparisonof percentclassificationaccuracy on two spoken languagetasks(“How mayI helpyou” on the left and“Help desk”on the right) asa func-tion of the numberof training examplesusingdataandknowledgeseparatelyortogether, asreportedby Rocheryet al. [64, 65].
We alsohave discusseda few of thegrowing numberof applicationsof Ada-Boostto practicalmachinelearningproblems,suchastext andspeechcategoriza-tion.
References
[1] Steven Abney, RobertE. Schapire,andYoramSinger. Boostingappliedto taggingandPPattachment.In Proceedingsof the Joint SIGDAT Conferenceon EmpiricalMethodsin Natural LanguageProcessingandVery LargeCorpora, 1999.
[2] Erin L. All wein, RobertE. Schapire,and Yoram Singer. Reducingmulticlasstobinary: A unifying approachfor margin classifiers. Journal of Machine LearningResearch, 1:113–141,2000.
[3] PeterL. Bartlett. The samplecomplexity of patternclassificationwith neuralnet-works: thesizeof theweightsis moreimportantthanthesizeof thenetwork. IEEETransactionson InformationTheory, 44(2):525–536,March1998.
[4] Eric BauerandRonKohavi. An empiricalcomparisonof voting classificationalgo-rithms: Bagging,boosting,andvariants.MachineLearning, 36(1/2):105–139,1999.
[5] Eric B. BaumandDavid Haussler. Whatsizenetgivesvalid generalization?NeuralComputation, 1(1):151–160,1989.
[6] AnselmBlumer, Andrzej Ehrenfeucht,David Haussler, andManfredK. Warmuth.LearnabilityandtheVapnik-Chervonenkisdimension.Journalof theAssociationforComputingMachinery, 36(4):929–965,October1989.
17
4:1/0.27,4/0.17 5:0/0.26,5/0.17 7:4/0.25,9/0.18 1:9/0.15,7/0.15 2:0/0.29,2/0.19 9:7/0.25,9/0.17 2:3/0.27,2/0.19 8:2/0.30,8/0.21 4:1/0.27,4/0.18
4:1/0.28,4/0.20 2:8/0.22,2/0.17 0:2/0.26,0/0.19 5:3/0.25,5/0.20 4:1/0.26,4/0.19 7:2/0.22,3/0.18 2:0/0.23,6/0.18 0:6/0.20,5/0.15 8:2/0.20,3/0.20
4:1/0.23,4/0.22 8:6/0.18,8/0.18 4:9/0.16,4/0.16 4:1/0.23,4/0.22 3:5/0.18,3/0.17 0:6/0.22,0/0.22 7:9/0.20,7/0.19 3:5/0.29,3/0.29 9:9/0.15,4/0.15
3:5/0.28,3/0.28 9:7/0.19,9/0.19 4:1/0.23,4/0.23 4:1/0.21,4/0.20 4:9/0.16,4/0.16 9:9/0.17,4/0.17 7:7/0.20,9/0.20 8:8/0.18,6/0.18 4:4/0.19,1/0.19
4:9/0.16,4/0.16 4:1/0.23,4/0.22 4:1/0.21,4/0.20 9:9/0.17,4/0.17 9:9/0.19,7/0.18 9:9/0.19,4/0.19 9:9/0.19,4/0.18 9:9/0.21,7/0.21 7:7/0.17,9/0.17
9:9/0.16,4/0.14 3:3/0.19,5/0.17 9:9/0.20,7/0.17 9:9/0.25,7/0.22 4:4/0.22,1/0.19 7:7/0.20,9/0.18 5:5/0.20,3/0.17 4:4/0.18,9/0.15 4:4/0.20,9/0.17
4:4/0.18,9/0.16 4:4/0.21,1/0.18 7:7/0.24,9/0.21 9:9/0.25,7/0.22 4:4/0.19,9/0.16 9:9/0.20,7/0.17 4:4/0.19,9/0.16 9:9/0.16,4/0.14 5:5/0.19,3/0.17
Figure7: A sampleof theexamplesthathave thelargestweightonanOCRtaskasreportedby FreundandSchapire[30]. Theseexampleswerechosenafter4 roundsof boosting(top line), 12 rounds(middle) and25 rounds(bottom). Underneatheachimageis a line of theform ì : í�îÔï�ð_î ,íGñ�ï�ðÖñ , where ì is thelabelof theexam-ple, í î and í ñ arethe labelsthatget thehighestandsecondhighestvote from thecombinedclassifierat that point in the run of the algorithm,and ð î , ð ñ arethecorrespondingnormalizedscores.
[7] BernhardE.Boser, IsabelleM. Guyon,andVladimir N. Vapnik.A trainingalgorithmfor optimalmargin classifiers.In Proceedingsof theFifth AnnualACM WorkshoponComputationalLearningTheory, pages144–152,1992.
[8] LeoBreiman.Arcing classifiers.TheAnnalsof Statistics, 26(3):801–849,1998.
[9] Leo Breiman. Prediction gamesand arcing classifiers. Neural Computation,11(7):1493–1517,1999.
[10] William Cohen.Fasteffectiverule induction. In Proceedingsof theTwelfthInterna-tional Conferenceon MachineLearning, pages115–123,1995.
[11] William W. CohenandYoramSinger. A simple,fast,andeffective rule learner. InProceedingsof theSixteenthNationalConferenceonArtificial Intelligence, 1999.
[12] MichaelCollins. Discriminativererankingfor naturallanguageparsing.In Proceed-ingsof theSeventeenthInternationalConferenceon MachineLearning, 2000.
[13] Michael Collins, RobertE. Schapire,andYoramSinger. Logistic regression,Ada-BoostandBregmandistances.MachineLearning, to appear.
[14] CorinnaCortesandVladimir Vapnik. Support-vectornetworks. MachineLearning,20(3):273–297,September1995.
18
[15] J. N. DarrochandD. Ratcliff. Generalizediterative scalingfor log-linearmodels.TheAnnalsof MathematicalStatistics, 43(5):1470–1480,1972.
[16] StephenDella Pietra,Vincent Della Pietra,and JohnLafferty. Inducing featuresof randomfields. IEEE TransactionsPattern Analysisand Machine Intelligence,19(4):1–13,April 1997.
[17] Ayhan Demiriz, Kristin P. Bennett,andJohnShawe-Taylor. Linear programmingboostingvia columngeneration.MachineLearning, 46(1/2/3):225–254,2002.
[18] ThomasG. Dietterich. An experimentalcomparisonof threemethodsfor construct-ing ensemblesof decisiontrees: Bagging,boosting,andrandomization. MachineLearning, 40(2):139–158,2000.
[19] ThomasG. DietterichandGhulumBakiri. Solvingmulticlasslearningproblemsviaerror-correctingoutputcodes.Journalof Artificial IntelligenceResearch, 2:263–286,January1995.
[20] HarrisDrucker. Improving regressorsusingboostingtechniques.In MachineLearn-ing: Proceedingsof theFourteenthInternationalConference, pages107–115,1997.
[21] HarrisDruckerandCorinnaCortes.Boostingdecisiontrees.In Advancesin NeuralInformationProcessingSystems8, pages479–485,1996.
[22] HarrisDrucker, RobertSchapire,andPatriceSimard.Boostingperformancein neuralnetworks. InternationalJournal of Pattern Recognition and Artificial Intelligence,7(4):705–719,1993.
[23] Nigel Duffy andDavid Helmbold. Potentialboosters?In Advancesin Neural Infor-mationProcessingSystems11, 1999.
[24] Nigel Duffy andDavid Helmbold.Boostingmethodsfor regression.MachineLearn-ing, 49(2/3),2002.
[25] GerardEscudero,Lluis Marquez,and GermanRigau. Boostingapplied to wordsensedisambiguation.In Proceedingsof the12thEuropeanConferenceon MachineLearning, pages129–141,2000.
[26] Yoav Freund. Boostinga weak learningalgorithm by majority. Information andComputation, 121(2):256–285,1995.
[27] Yoav Freund. An adaptive versionof the boostby majority algorithm. MachineLearning, 43(3):293–318,June2001.
[28] Yoav Freund,Raj Iyer, RobertE. Schapire,andYoramSinger. An efficient boost-ing algorithmfor combiningpreferences.In MachineLearning: Proceedingsof theFifteenthInternationalConference, 1998.
[29] Yoav FreundandLlew Mason. Thealternatingdecisiontreelearningalgorithm. InMachine Learning: Proceedingsof the SixteenthInternational Conference, pages124–133,1999.
19
[30] Yoav FreundandRobertE.Schapire.Experimentswith anew boostingalgorithm.InMachine Learning: Proceedingsof the ThirteenthInternationalConference, pages148–156,1996.
[31] Yoav FreundandRobertE. Schapire.Gametheory, on-linepredictionandboosting.In Proceedingsof theNinth AnnualConferenceon ComputationalLearningTheory,pages325–332,1996.
[32] Yoav FreundandRobertE. Schapire.A decision-theoreticgeneralizationof on-linelearningandanapplicationto boosting.Journal of ComputerandSystemSciences,55(1):119–139,August1997.
[33] Yoav FreundandRobertE. Schapire. Adaptive gameplaying usingmultiplicativeweights.GamesandEconomicBehavior, 29:79–103,1999.
[34] JeromeFriedman,Trevor Hastie,andRobertTibshirani.Additivelogistic regression:A statisticalview of boosting.TheAnnalsof Statistics, 38(2):337–374,April 2000.
[35] JeromeH. Friedman.Greedyfunctionapproximation:A gradientboostingmachine.TheAnnalsof Statistics, 29(5),October2001.
[36] JohannesFurnkranzandGerhardWidmer. Incrementalreducederror pruning. InMachineLearning:Proceedingsof theEleventhInternationalConference, pages70–77,1994.
[37] AdamJ. Grove andDaleSchuurmans.Boostingin the limit: Maximizing themar-gin of learnedensembles.In Proceedingsof the FifteenthNational ConferenceonArtificial Intelligence, 1998.
[38] Masahiko Haruno,SatoshiShirai,andYoshifumi Ooyama. Using decisiontreestoconstructa practicalparser. MachineLearning, 34:131–149,1999.
[39] Raj D. Iyer, David D. Lewis, RobertE. Schapire,YoramSinger, andAmit Singhal.Boostingfor documentrouting.In Proceedingsof theNinthInternationalConferenceon InformationandKnowledgeManagement, 2000.
[40] Jeffrey C. JacksonandMark W. Craven. Learningsparseperceptrons.In Advancesin Neural InformationProcessingSystems8, pages654–660,1996.
[41] MichaelKearnsandLeslieG. Valiant.LearningBooleanformulaeor finite automatais ashardasfactoring.TechnicalReportTR-14-88,HarvardUniversityAikenCom-putationLaboratory, August1988.
[42] MichaelKearnsandLeslieG.Valiant.CryptographiclimitationsonlearningBooleanformulaeandfinite automata.Journal of theAssociationfor ComputingMachinery,41(1):67–95,January1994.
[43] Jyrki KivinenandManfredK. Warmuth.Boostingasentropy projection.In Proceed-ings of the Twelfth AnnualConferenceon ComputationalLearning Theory, pages134–144,1999.
20
[44] V. Koltchinskii andD. Panchenko. Empiricalmargin distributionsandboundingthegeneralizationerrorof combinedclassifiers.TheAnnalsof Statistics, 30(1),February2002.
[45] Vladimir Koltchinskii, Dmitriy Panchenko, andFernandoLozano. Furtherexplana-tion of theeffectivenessof votingmethods:Thegamebetweenmarginsandweights.In Proceedings14thAnnualConferenceonComputationalLearningTheoryand5thEuropeanConferenceon ComputationalLearningTheory, pages241–255,2001.
[46] Vladimir Koltchinskii,Dmitriy Panchenko,andFernandoLozano.Somenew boundson thegeneralizationerrorof combinedclassifiers.In Advancesin Neural Informa-tion ProcessingSystems13, 2001.
[47] JohnLafferty. Additive models,boostingandinferencefor generalizeddivergences.In Proceedingsof the Twelfth AnnualConferenceon ComputationalLearningThe-ory, pages125–133,1999.
[48] Guy LebanonandJohnLafferty. Boostingandmaximumlikelihoodfor exponentialmodels.In Advancesin Neural InformationProcessingSystems14, 2002.
[49] RichardMaclin andDavid Opitz. An empiricalevaluationof baggingandboost-ing. In Proceedingsof theFourteenthNationalConferenceonArtificial Intelligence,pages546–551,1997.
[50] Llew Mason,PeterBartlett, andJonathanBaxter. Direct optimizationof marginsimprovesgeneralizationin combinedclassifiers.In Advancesin Neural InformationProcessingSystems12, 2000.
[51] Llew Mason,JonathanBaxter, PeterBartlett, andMarcusFrean. Functionalgradi-ent techniquesfor combininghypotheses.In AlexanderJ. Smola,PeterJ. Bartlett,BernhardScholkopf,andDaleSchuurmans,editors,Advancesin LargeMargin Clas-sifiers. MIT Press,1999.
[52] Llew Mason,JonathanBaxter, PeterBartlett,andMarcusFrean.Boostingalgorithmsasgradientdescent.In Advancesin Neural InformationProcessingSystems12, 2000.
[53] StefanoMerler, CesareFurlanello,BarbaraLarcher, andAndreaSboner. Tuningcost-sensitive boostingandits applicationto melanomadiagnosis.In Multiple ClassifierSystems:Proceedingsof the2ndInternationalWorkshop, pages32–42,2001.
[54] C. J.Merz andP. M. Murphy. UCI repositoryof machinelearningdatabases,1999.www.ics.uci.edu/ò mlearn/MLRepository.html.
[55] PedroJ.Moreno,BethLogan,andBhikshaRaj. A boostingapproachfor confidencescoring. In Proceedingsof the7th EuropeanConferenceon Speech CommunicationandTechnology, 2001.
[56] MichaelC.Mozer, RichardWolniewicz,David B. Grimes,Eric Johnson,andHowardKaushansky. Predictingsubscriberdissatisfaction and improving retentionin thewireless telecommunicationsindustry. IEEE Transactionson Neural Networks,11:690–696,2000.
21
[57] TakashiOnoda,GunnarRatsch,andKlaus-RobertMuller. Applying supportvectormachinesandboostingto a non-intrusive monitoringsystemfor householdelectricapplianceswith inverters.In Proceedingsof theSecondICSCSymposiumon NeuralComputation, 2000.
[58] Dmitriy Panchenko. New zero-errorboundsfor voting algorithms. Unpublishedmanuscript,2001.
[59] J. R. Quinlan. Bagging,boosting,andC4.5. In Proceedingsof the ThirteenthNa-tional Conferenceon Artificial Intelligence, pages725–730,1996.
[60] J.RossQuinlan.C4.5: Programsfor MachineLearning. MorganKaufmann,1993.
[61] G. Ratsch,T. Onoda,andK.-R. Muller. Softmarginsfor AdaBoost.MachineLearn-ing, 42(3):287–320,2001.
[62] GunnarRatsch,ManfredWarmuth,SebastianMika, TakashiOnoda,StevenLemm,andKlaus-RobertMuller. Barrierboosting.In Proceedingsof theThirteenthAnnualConferenceon ComputationalLearningTheory, pages170–179,2000.
[63] Greg Ridgeway, David Madigan,andThomasRichardson.Boostingmethodologyfor regressionproblems. In Proceedingsof the InternationalWorkshopon AI andStatistics, pages152–161,1999.
[64] M. Rochery, R. Schapire,M. Rahim,N. Gupta,G. Riccardi,S. Bangalore,H. Al-shawi, andS.Douglas.Combiningprior knowledgeandboostingfor call classifica-tion in spokenlanguagedialogue.Unpublishedmanuscript,2001.
[65] MarieRochery, RobertSchapire,MazinRahim,andNarendraGupta.BoosTexterfortext categorizationin spokenlanguagedialogue.Unpublishedmanuscript,2001.
[66] RobertE. Schapire.Thestrengthof weaklearnability. MachineLearning, 5(2):197–227,1990.
[67] RobertE. Schapire.Using outputcodesto boostmulticlasslearningproblems. InMachineLearning: Proceedingsof the FourteenthInternationalConference, pages313–321,1997.
[68] RobertE. Schapire.Drifting games.MachineLearning, 43(3):265–291,June2001.
[69] RobertE. Schapire,Yoav Freund,PeterBartlett, andWee SunLee. Boostingthemargin: A new explanationfor the effectivenessof voting methods.TheAnnalsofStatistics, 26(5):1651–1686,October1998.
[70] Robert E. Schapireand Yoram Singer. Improved boosting algorithms usingconfidence-ratedpredictions.MachineLearning, 37(3):297–336,December1999.
[71] RobertE. SchapireandYoramSinger. BoosTexter: A boosting-basedsystemfor textcategorization.MachineLearning, 39(2/3):135–168,May/June2000.
[72] RobertE. Schapire,Yoram Singer, andAmit Singhal. Boostingand Rocchioap-plied to text filtering. In Proceedingsof the21stAnnualInternationalConferenceonResearch andDevelopmentin InformationRetrieval, 1998.
22
[73] Holger Schwenkand YoshuaBengio. Training methodsfor adaptive boostingofneuralnetworks. In Advancesin Neural InformationProcessingSystems10, pages647–653,1998.
[74] Kinh Tieu andPaul Viola. Boostingimageretrieval. In Proceedingsof the IEEEConferenceon ComputerVisionandPatternRecognition, 2000.
[75] L. G. Valiant.A theoryof thelearnable.Communicationsof theACM, 27(11):1134–1142,November1984.
[76] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergenceof relativefrequenciesof eventsto theirprobabilities.TheoryofProbabilityandits applications,XVI(2):264–280,1971.
[77] Vladimir N. Vapnik. TheNature of StatisticalLearningTheory. Springer, 1995.
[78] Marilyn A. Walker, OwenRambow, andMonicaRogati.SPoT:A trainablesentenceplanner. In Proceedingsof the2ndAnnualMeetingof theNorth AmericanChapterof theAssociataionfor ComputationalLinguistics, 2001.
23