from binary to multiclass classification · 2020. 1. 14. · from binary to multiclass...

Post on 09-Sep-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS6355:StructuredPrediction

FromBinarytoMulticlassClassification

1

Wehaveseenbinaryclassification

• Wehaveseenlinearmodels• Learningalgorithms– Perceptron– SVM– LogisticRegression

• Predictionissimple– Givenanexample 𝐱,output= sgn(𝐰𝑇𝐱)– Outputisasinglebit

2

Whatifwehavemorethantwolabels?

3

Readingfornextlecture:

ErinL.Allwein,RobertE.Schapire,Yoram Singer, ReducingMulticlasstoBinary:AUnifyingApproachforMarginClassifiers,ICML2000.

4

Multiclassclassification

• Introduction

• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes

• Trainingasingleclassifier– MulticlassSVM– Constraintclassification

5

Wherearewe?

• Introduction

• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes

• Trainingasingleclassifier– MulticlassSVM– Constraintclassification

6

Whatismulticlassclassification?

• AninputcanbelongtooneofKclasses

• Trainingdata:examplesassociatedwithclasslabel(anumberfrom1toK)

• Prediction:Givenanewinput,predicttheclasslabel

Eachinputbelongstoexactlyoneclass.Notmore,notless.• Otherwise,theproblemisnotmulticlassclassification

• Ifaninputcanbeassignedmultiplelabels(thinktagsforemailsratherthanfolders),itiscalledmulti-labelclassification

7

Exampleapplications:Images

– Input:hand-writtencharacter;Output:whichcharacter?

– Input:aphotographofanobject;Output:whichofasetofcategoriesofobjectsisit?• Eg:theCaltech256dataset

8

allmaptotheletterA

Cartire Cartire Duck laptop

Exampleapplications:Language

• Input:anewsarticle• Output:Whichsectionofthenewspapershouldbebein

• Input:anemail• Output:whichfoldershouldanemailbeplacedinto

• Input:anaudiocommandgiventoacar• Output:whichofasetofactionsshouldbeexecuted

9

Wherearewe?

• Introduction

• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes

• Trainingasingleclassifier– MulticlassSVM– Constraintclassification

10

Binarytomulticlass

• Canweuseanalgorithmfortrainingbinaryclassifierstoconstructamulticlassclassifier?– Answer:Decomposethepredictionintomultiplebinarydecisions

• Howtodecompose?– One-vs-all– All-vs-all– Errorcorrectingcodes

11

Generalsetting

• Input𝐱 ∈ ℜ-– Theinputsarerepresentedbytheirfeaturevectors

• Output𝐲 ∈ 1,2,⋯ ,𝐾– Theseclassesrepresentdomain-specificlabels

• Learning:Givenadataset𝐷 = {(𝐱𝑖, 𝐲𝑖)}– NeedalearningalgorithmthatusesDtoconstructafunctionthatcan

predict𝐱 to 𝐲– Goal:findapredictorthatdoeswellonthetrainingdataandhaslow

generalizationerror

• Prediction/Inference:Givenanexample𝐱 andthelearnedfunction,computetheclasslabelfor 𝐱

12

1.One-vs-allclassification

• Assumption:Eachclassindividuallyseparablefromall theothers

• Learning:Givenadataset𝐷 = {(𝐱𝑖, 𝐲𝑖)}– DecomposeintoKbinaryclassificationtasks– Forclassk,constructabinaryclassificationtaskas:

• Positiveexamples:ElementsofDwithlabelk• Negativeexamples:AllotherelementsofD

– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen

13

𝒙 ∈ ℜ-𝒚 ∈ 1,2,⋯ , 𝐾

1.One-vs-allclassification

• Assumption:Eachclassindividuallyseparablefromall theothers

• Learning:Givenadataset𝐷 = {(𝐱𝑖, 𝐲𝑖)}– DecomposeintoKbinaryclassificationtasks– Forclassk,constructabinaryclassificationtaskas:

• Positiveexamples:ElementsofDwithlabelk• Negativeexamples:AllotherelementsofD

– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen

14

𝐱 ∈ ℜ-𝐲 ∈ 1,2,⋯ , 𝐾

1.One-vs-allclassification

• Assumption:Eachclassindividuallyseparablefromall theothers

• Learning:Givenadataset𝐷 = {(𝐱i, 𝐲𝑖)}– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen

• Prediction:“WinnerTakesAll”argmax𝑖𝐰𝑖

𝑇𝐱

15

𝒙 ∈ ℜ-𝒚 ∈ 1,2,⋯ , 𝐾

1.One-vs-allclassification

• Assumption:Eachclassindividuallyseparablefromall theothers

• Learning:Givenadataset𝐷 = {(𝐱i, 𝐲𝑖)}– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen

• Prediction:“WinnerTakesAll”argmax𝑖𝐰𝑖

𝑇𝐱

16

𝒙 ∈ ℜ-𝒚 ∈ 1,2,⋯ , 𝐾

Question:Whatisthedimensionalityofeachwi?

VisualizingOne-vs-all

17

VisualizingOne-vs-all

Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass

18

VisualizingOne-vs-all

Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass

19

wblueTx >0

forblueinputs

VisualizingOne-vs-all

Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass

20

wblueTx >0

forblueinputs

wredTx >0

forredinputs

wgreenTx >0

forgreeninputs

VisualizingOne-vs-all

Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass

21

wblueTx >0

forblueinputs

wredTx >0

forredinputs

wgreenTx >0

forgreeninputs

Notation:Scoreforbluelabel

VisualizingOne-vs-all

Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass

22

wblueTx >0

forblueinputs

wredTx >0

forredinputs

wgreenTx >0

forgreeninputs

Notation:Scoreforbluelabel

WinnerTakeAllwillpredicttherightanswer.Onlythecorrectlabelwillhaveapositivescore

One-vs-allmaynotalwaysworkBlackpointsarenotseparablewithasinglebinaryclassifier

Thedecompositionwillnotworkforthesecases!

wblueTx >0

forblueinputs

wredTx >0

forredinputs

wgreenTx >0

forgreeninputs

???

23

One-vs-allclassification:Summary

• Easytolearn– Useanybinaryclassifierlearningalgorithm

• Problems– Notheoreticaljustification– Calibrationissues

• WearecomparingscoresproducedbyKclassifierstrainedindependently.Noreasonforthescorestobeinthesamenumericalrange!

– Mightnotalwayswork• Yet,worksfairlywellinmanycases,especiallyiftheunderlyingbinaryclassifiersaretuned,regularized

24

2.All-vs-allclassification

• Assumption:Every pairofclassesisseparable

Sometimescalledone-vs-one

25

2.All-vs-allclassification

• Assumption:Every pairofclassesisseparable

• Learning:Givenadataset𝐷 = {(𝐱𝒊, 𝐲𝑖)},– Foreverypairoflabels(j,k),createabinaryclassifierwith:

• Positiveexamples:Allexampleswithlabelj• Negativeexamples:Allexampleswithlabelk

– Train 𝐾2 = @(@AB)C

classifierstoseparateeverypairoflabelsfromeachother

Sometimescalledone-vs-one

26

𝐱 ∈ ℜ-𝐲 ∈ 1,2,⋯ , 𝐾

2.All-vs-allclassification

• Assumption:Every pairofclassesisseparable

• Learning:Givenadataset𝐷 = {(𝐱𝒊, 𝐲𝑖)},– Train 𝐾2 = @(@AB)

Cclassifierstoseparateeverypairof

labelsfromeachother

• Prediction:Morecomplex,eachlabelgetK-1votes– Howtocombinethevotes?Manymethods

• Majority:Pickthelabelwithmaximumvotes• Organizeatournamentbetweenthelabels

Sometimescalledone-vs-one

27

𝐱 ∈ ℜ-𝐲 ∈ 1,2,⋯ , 𝐾

All-vs-allclassification

• Everypairoflabelsislinearlyseparablehere– Whenapairoflabelsisconsidered,allothersareignored

• Problems1. O(K2)weightvectorstotrainandstore

2. Sizeoftrainingsetforapairoflabelscouldbeverysmall,leadingtooverfittingofthebinaryclassifiers

3. Predictionisoftenad-hocandmightbeunstableEg:Whatiftwoclassesgetthesamenumberofvotes?Foratournament,whatisthesequenceinwhichthelabelscompete?

28

3.Errorcorrectingoutputcodes(ECOC)

• Eachbinaryclassifierprovidesonebitofinformation

• WithKlabels,weonlyneedlog2Kbitstorepresentthelabel– One-vs-allusesK bits(oneperclassifier)– All-vs-allusesO(K2)bits

• CanwegetbywithO(logK)classifiers?– Yes! Encodeeachlabelasabinarystring– Oralternatively,ifwedotrainmorethanO(logK)classifiers,can

weusetheredundancytoimproveclassificationaccuracy?

29

Usinglog2Kclassifiers

• Learning:– Representeachlabelbyabitstring(i.e.,itscode)– Trainonebinaryclassifierforeachbit

• Prediction:– Usethepredictionsfromalltheclassifierstocreatealog2Nbit

stringthatuniquelydecidestheoutput

• Whatcouldgowronghere?– Evenifoneoftheclassifiersmakesamistake,finalpredictionis

wrong!

30

label# Code

0 0 0 0

1 0 0 1

2 0 1 0

3 0 1 1

4 1 0 0

5 1 0 1

6 1 1 0

7 1 1 1

8 classes,code-length=3

Example:Forsomeexample,ifthethreeclassifierspredict0,1 and1,thenthelabelis3

Usinglog2Kclassifiers

• Learning:– Representeachlabelbyabitstring(i.e.,itscode)– Trainonebinaryclassifierforeachbit

• Prediction:– Usethepredictionsfromalltheclassifierstocreatealog2Nbit

stringthatuniquelydecidestheoutput

• Whatcouldgowronghere?– Evenifoneoftheclassifiersmakesamistake,finalpredictionis

wrong!

31

label# Code

0 0 0 0

1 0 0 1

2 0 1 0

3 0 1 1

4 1 0 0

5 1 0 1

6 1 1 0

7 1 1 1

8 classes,code-length=3

Usinglog2Kclassifiers

• Learning:– Representeachlabelbyabitstring(i.e.,itscode)– Trainonebinaryclassifierforeachbit

• Prediction:– Usethepredictionsfromalltheclassifierstocreatealog2Nbit

stringthatuniquelydecidestheoutput

• Whatcouldgowronghere?– Evenifoneoftheclassifiersmakesamistake,finalpredictionis

wrong!

32

label# Code

0 0 0 0

1 0 0 1

2 0 1 0

3 0 1 1

4 1 0 0

5 1 0 1

6 1 1 0

7 1 1 1

8 classes,code-length=3

Errorcorrectingoutputcoding

Answer:Useredundancy• Assignabinarystringwitheachlabel

– Couldberandom– LengthofthecodewordL >=log2Kisaparameter

• Trainonebinaryclassifierforeachbit– Effectively,splitthedataintorandomdichotomies– Weneedonlylog2Kbits

• Additionalbitsactasanerrorcorrectingcode

33

8 classes,code-length=5

# Code

0 0 0 0 0 0

1 0 0 1 1 0

2 0 1 0 1 1

3 0 1 1 0 1

4 1 0 0 1 1

5 1 0 1 0 0

6 1 1 0 0 0

7 1 1 1 1 1

Howtopredict?

• Prediction– RunallL binaryclassifiersontheexample– GivesusapredictedbitstringoflengthL– Output=labelwhosecodewordis“closest”to

theprediction– ClosestdefinedusingHammingdistance

• Longercodelengthisbetter,bettererror-correction

• Example– Supposethebinaryclassifiersherepredict11010– Theclosestlabeltothisis6,withcodeword11000

34

8 classes,code-length=5

# Code

0 0 0 0 0 0

1 0 0 1 1 0

2 0 1 0 1 1

3 0 1 1 0 1

4 1 0 0 1 1

5 1 0 1 0 0

6 1 1 0 0 0

7 1 1 1 1 1

Howtopredict?

• Prediction– RunallL binaryclassifiersontheexample– GivesusapredictedbitstringoflengthL– Output=labelwhosecodewordis“closest”to

theprediction– ClosestdefinedusingHammingdistance

• Longercodelengthisbetter,bettererror-correction

• Example– Supposethebinaryclassifiersherepredict11010– Theclosestlabeltothisis6,withcodeword11000

35

8 classes,code-length=5

# Code

0 0 0 0 0 0

1 0 0 1 1 0

2 0 1 0 1 1

3 0 1 1 0 1

4 1 0 0 1 1

5 1 0 1 0 0

6 1 1 0 0 0

7 1 1 1 1 1

One-vs-allisaspecialcaseofthisscheme.How?

Errorcorrectingcodes:Discussion

• Assumesthatcolumnsareindependent– Otherwise,ineffectiveencoding

• Strongtheoreticalresultsthatdependoncodelength– IfminimalHammingdistancebetweentworowsisd,thenthe

predictioncancorrectupto(d-1)/2errorsinthebinarypredictions

• Codeassignmentcouldberandom,ordesignedforthedataset/task

• One-vs-allandall-vs-allarespecialcases– All-vs-allneedsaternarycode(notbinary)

36

Errorcorrectingcodes:Discussion

• Assumesthatcolumnsareindependent– Otherwise,ineffectiveencoding

• Strongtheoreticalresultsthatdependoncodelength– IfminimalHammingdistancebetweentworowsisd,thenthe

predictioncancorrectupto(d-1)/2errorsinthebinarypredictions

• Codeassignmentcouldberandom,ordesignedforthedataset/task

• One-vs-allandall-vs-allarespecialcases– All-vs-allneedsaternarycode(notbinary)

37

Exercise:Convinceyourselfthatthisiscorrect

Decompositionmethods:Summary

• Generalidea– Decomposethemulticlassproblemintomanybinaryproblems– Weknowhowtotrainbinaryclassifiers– Predictiondependsonthedecomposition

• Constructsthemulticlasslabelfromtheoutputofthebinaryclassifiers

• Learningoptimizeslocalcorrectness– Eachbinaryclassifierdoesnotneedtobegloballycorrect

• Thatis,theclassifiersdonothavetoagreewitheachother– Thelearningalgorithmisnotevenawareofthepredictionprocedure!

• Poordecompositiongivespoorperformance– Difficultlocalproblems,canbe“unnatural”

• Eg.ForECOC,whyshouldthebinaryproblemsbeseparable?

38

Wherearewe?

• Introduction

• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes

• Trainingasingleclassifier– MulticlassSVM– Constraintclassification

39

Motivation

• Decompositionmethods– Donotaccountforhowthefinalpredictorwillbeused– Donotoptimizeanyglobalmeasureofcorrectness

• Goal:Totrainamulticlassclassifierthatis“global”

40

Recall:Marginforbinaryclassifiers

Themargin ofahyperplaneforadataset:thedistancebetweenthehyperplaneandthedatapointnearesttoit

41

++

++

+ +++

-- --

-- -- --

---- --

--

Marginwithrespecttothishyperplane

Multiclassmargin

Definedasthescoredifferencebetweenthehighestscoringlabelandthesecondone

42

Labels

Scoreforalabel

Blue

Red

Green

Black

=wlabelTx

Multiclassmargin

Definedasthescoredifferencebetweenthehighestscoringlabelandthesecondone

43

Labels

Scoreforalabel

Blue

Red

Green

Black

=wlabelTx

MulticlassMargin

MulticlassSVM(Intuition)

• Recall:BinarySVM– Maximizemargin– Equivalently,

Minimizenormofweightssuchthattheclosestpointstothehyperplanehaveascore±1

• MulticlassSVM– Eachlabelhasadifferentweightvector(likeone-vs-all)– Maximizemulticlassmargin– Equivalently,

Minimizetotalnormoftheweightssuchthatthetruelabelisscoredatleast1morethanthesecondbestone

44

MulticlassSVMintheseparablecase

45

RecallhardbinarySVM

𝑠𝑐𝑜𝑟𝑒 𝑦J – 𝑠𝑐𝑜𝑟𝑒 𝑘 ≥ 1

𝑅𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟 𝐰B,⋯ ,𝒘@

MulticlassSVMintheseparablecase

46

RecallhardbinarySVM

𝑅𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟 𝐰B,⋯ ,𝒘@

MulticlassSVMintheseparablecase

47

RecallhardbinarySVM

MulticlassSVMintheseparablecase

48

RecallhardbinarySVM

Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1

MulticlassSVMintheseparablecase

49

RecallhardbinarySVM

Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1

Sizeoftheweights.Effectively,regularizer

MulticlassSVMintheseparablecase

50

RecallhardbinarySVM

Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1

Sizeoftheweights.Effectively,regularizer

Problemswiththis?

MulticlassSVMintheseparablecase

51

RecallhardbinarySVM

Thescoreforthetruelabelishigherthanthescoreforanyotherlabelby1

Sizeoftheweights.Effectively,regularizer

Problemswiththis?

Whatifthereisnosetofweightsthatachievesthisseparation?Thatis,whatifthedataisnotlinearlyseparable?

MulticlassSVM:Generalcase

52

Sizeoftheweights.Effectively,regularizer

Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i

Slackvariables.Notallexamplesneedtosatisfythemargin

constraint.

MulticlassSVM:Generalcase

53

Sizeoftheweights.Effectively,regularizer

Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i

Slackvariables.Notallexamplesneedtosatisfythemargin

constraint.

Totalslack.Don’tallowtoomanyexamplestoviolatethemargin

constraint

MulticlassSVM:Generalcase

54

Sizeoftheweights.Effectively,regularizer

Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i

Slackvariables.Notallexamplesneedtosatisfythemargin

constraint.

Totalslack.Don’tallowtoomanyexamplestoviolatethemargin

constraint

Slackvariablescanonlybepositive

MulticlassSVM:Generalcase

55

Sizeoftheweights.Effectively,regularizer

Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i

Slackvariables.Notallexamplesneedtosatisfythemargin

constraint.

Totalslack.Don’tallowtoomanyexamplestoviolatethemargin

constraint

Slackvariablescanonlybepositive

MulticlassSVM:Generalcase

56

Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i

Sizeoftheweights.Effectively,regularizer

Slackvariables.Notallexamplesneedtosatisfythemargin

constraint.

Totalslack.Don’tallowtoomanyexamplestoviolatethemargin

constraint

Slackvariablescanonlybepositive

MulticlassSVM:Generalcase

57

Solving

Isequivalenttosolving

min𝐰U,𝐰V,⋯,𝐰W

12X𝐰J

Y𝐰J + 𝐶 X max 0,max]^𝐲_

𝐰]Y𝐱J − 𝐰𝐲_

Y 𝐱J + 1�

(𝐱_,𝐲_)∈b

J

Why?

MulticlassSVM:Generalcase

58

min𝐰U,𝐰V,⋯,𝐰W

12X𝐰J

Y𝐰J + 𝐶 X max 0,max]^𝐲_

𝐰]Y𝐱J − 𝐰𝐲_

Y 𝐱J + 1�

(𝐱_,𝐲_)∈b

J

Sizeoftheweights.Effectively,regularizer

MulticlassSVM:Generalcase

59

min𝐰U,𝐰V,⋯,𝐰W

12X𝐰J

Y𝐰J + 𝐶 X max 0,max]^𝐲_

𝐰]Y𝐱J − 𝐰𝐲_

Y 𝐱J + 1�

(𝐱_,𝐲_)∈b

J

Sizeoftheweights.Effectively,regularizer Themulticlasshingeloss

MulticlassSVM:Generalcase

60

min𝐰U,𝐰V,⋯,𝐰W

12X𝐰J

Y𝐰J + 𝐶 X max 0,max]^𝐲_

𝐰]Y𝐱J − 𝐰𝐲_

Y 𝐱J + 1�

(𝐱_,𝐲_)∈b

J

Sizeoftheweights.Effectively,regularizer Themulticlasshingeloss

Thetradeoffhyperparameter

MulticlassSVM

• GeneralizesbinarySVMalgorithm– Ifwehaveonlytwoclasses,thisreducestothebinary(uptoscale)

• ComeswithsimilargeneralizationguaranteesasthebinarySVM

• Canbetrainedusingdifferentoptimizationmethods– Stochasticsub-gradientdescentcanbegeneralized

• Tryasexercise

61

MulticlassSVM:Summary

• Training:– OptimizetheSVMobjective

• Prediction:– Winnertakesall

argmaxi wiTx

• WithKlabelsandinputsin<n,wehavenK weightsinall– Sameasone-vs-all

– Butcomeswithguarantees!

62Questions?

Wherearewe?

• Introduction

• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes

• Trainingasingleclassifier– MulticlassSVM– Constraintclassification

63

Letusexamineone-vs-allagain

• Training:– CreateKbinaryclassifiersw1,w2,…,wK

– wi separatesclassi fromallothers

• Prediction:argmaxi wiTx

• Observations:1. Attrainingtime,werequirewi

Tx tobepositiveforexamplesofclassi.

2. Really,allweneedis forwiTx tobemorethanallothers

Therequirementofbeingpositiveismorestrict

64

Rewriteinputsandweightvector• Stackallweightvectorsintoan

nK-dimensionalvector

• Defineafeaturevectorforlabeli beingassociatedtoinputx:

LinearSeparability withmultipleclasses

65

xintheith block,zeroseverywhereelse

Forexampleswithlabeli,wewantwiTx >wj

Tx forallj

Rewriteinputsandweightvector• Stackallweightvectorsintoan

nK-dimensionalvector

• Defineafeaturevectorforlabeli beingassociatedtoinputx:

LinearSeparability withmultipleclasses

66

xintheith block,zeroseverywhereelse

Forexampleswithlabeli,wewantwiTx >wj

Tx forallj

ThisiscalledtheKesler construction

LinearSeparability withmultipleclasses

Equivalentrequirement:

67

xintheith block,zeroseverywhereelse

Forexampleswithlabeli,wewantwiTx >wj

Tx forallj

Or:

LinearSeparability withmultipleclasses

68

ithblock

Forexampleswithlabeli,wewantwiTx >wj

Tx foralljOrequivalently:

LinearSeparability withmultipleclasses

69

ithblock

Foreveryexample(x,i)indataset,allotherlabelsj

Positiveexamples Negativeexamples

Thatis,thefollowingbinarytaskinnK dimensionsthatshouldbelinearlyseparable

Forexampleswithlabeli,wewantwiTx >wj

Tx foralljOrequivalently:

ConstraintClassification

• Training:– Givenadataset{(x,y)},createabinaryclassificationtask

• Positiveexamples:Á(x,y)- Á(x,y’)• Negativeexamples:Á(x, y’)- Á(x,y)foreveryexample,foreveryy’≠y

– Useyourfavoritealgorithmtotrainabinaryclassifier

• Prediction:GivenanK dimensionalweightvectorwandanewexamplex

argmaxy wT Á(x,y)

70

ConstraintClassification

• Training:– Givenadataset{(x,y)},createabinaryclassificationtask

• Positiveexamples:Á(x,y)- Á(x,y’)• Negativeexamples:Á(x, y’)- Á(x,y)foreveryexample,foreveryy’≠y

– Useyourfavoritealgorithmtotrainabinaryclassifier

• Prediction:GivenanK dimensionalweightvectorwandanewexamplex

argmaxy wT Á(x,y)

71

ConstraintClassification

• Training:– Givenadataset{(x,y)},createabinaryclassificationtask

• Positiveexamples:Á(x,y)- Á(x,y’)• Negativeexamples:Á(x, y’)- Á(x,y)foreveryexample,foreveryy’≠y

– Useyourfavoritealgorithmtotrainabinaryclassifier

• Prediction:GivenanK dimensionalweightvectorwandanewexamplex

argmaxy wT Á(x,y)

72

Exercise:WhatdotheperceptronupdaterulelooklikeintermsoftheÁs?Interprettheupdatestep

ConstraintClassification

• Training:– Givenadataset{(x,y)},createabinaryclassificationtask

• Positiveexamples:Á(x,y)- Á(x,y’)• Negativeexamples:Á(x, y’)- Á(x,y)foreveryexample,foreveryy’≠y

– Useyourfavoritealgorithmtotrainabinaryclassifier

• Prediction:GivenanK dimensionalweightvectorwandanewexamplex

argmaxy wT Á(x,y)

73

Note:Thebinaryclassificationtaskonlyexpressespreferencesoverlabelassignments

Thisapproachextendstotrainingaranker,canusepartialpreferencestoo,moreonthislater…

Asecondlookatthemulticlassmargin

74

Definedasthescoredifferencebetweenthehighestscoringlabelandthesecondone

Labels

Scoreforalabel

Blue

Red

Green

Black

MulticlassMargin

Asecondlookatthemulticlassmargin

75

Definedasthescoredifferencebetweenthehighestscoringlabelandthesecondone

Labels

Scoreforalabel

Blue

Red

Green

Black

MulticlassMarginIntermsofKeslerconstruction

Herey isthelabelthathasthehighestscore

Discussion

• ThenumberofweightsformulticlassSVMandconstraintclassificationisstillsameasOne-vs-all,muchlessthanall-vs-allK(K-1)/2

• Butbothstillaccountforallpairwiselabelpreferences– MulticlassSVMviathedefinitionofthelearningobjective

– Constraintclassificationbyconstructingabinaryclassificationproblem

• Bothcomewiththeoreticalguaranteesforgeneralization

• Importantideathatisapplicablewhenwemovetoarbitrarystructures

76Questions?

Trainingmulticlassclassifiers:Wrap-up

• Labelbelongstoasetthathasmorethantwoelements

• Methods– Decompositionintoacollectionofbinary(local)decisions

• One-vs-all• All-vs-all• Errorcorrectingcodes

– Trainingasingle(global)classifier• MulticlassSVM• Constraintclassification

• Exercise:Whichofthesewillworkforthiscase?

77Questions?

Nextsteps…

• Builduptostructuredprediction– Multiclassisreallyasimplestructure

• Differentaspectsofstructuredprediction– Decidingthestructure,training,inference

• Sequencemodels

78

top related