lecture 4 [autosaved] - purdue university · linear classifiers • linear threshold functions –...

61
ML4NLP Learning as Optimization Dan Goldwasser CS 590NLP Purdue University [email protected]

Upload: others

Post on 23-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

ML4NLPLearning as Optimization

Dan Goldwasser

CS 590NLP

Purdue [email protected]

Page 2: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Readingfornexttime

Usingclassificationtounderstandhumaninteractions

SlidetakenfromD.Jurafsky

LinktoreadingmaterialinPiazza,underresources

Page 3: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Classification

• Afundamentalmachine learningtool–WidelyapplicableinNLP

• Supervisedlearning:Learnerisgivenacollectionoflabeleddocuments– Emails:Spam/notspam;Reviews:Pos/Neg

• Buildafunction mappingdocumentstolabels– Keyproperty:Generalization• functionshouldworkwellonnewdata

Page 4: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

SentimentAnalysis

Dude, I just watched this horror flick! Selling points: nightmares scenes, torture scenes, terrible monsters that was so bad a##!

Don’t buy the popcorn it was terrible, the monsters selling it must have wanted to torture me, it was so bad it gave me nightmares!

What should your learning algorithm look at?

Page 5: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

DeceptiveReviews

FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?

Page 6: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

PowerRelations

EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al .WWW2012.

BlahUnacceptableblah

Yourhonor, Iagreeblahblahblah

What should your learning algorithm look at?

Page 7: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

NaiveBayes

7

P(x1,x2, ...,xn | vj ) =

= P(x1 | x2, ...,xn , vj)P(x2, ...,xn | vj) = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3, ...,xn | vj) = ....... = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3 | x4 , ...,xn , vj)...P(xn | vj)

VMAP = argmaxv P(x1, x2, …, xn | v )P(v)

Assumption:featurevaluesareindependentgiventhetargetvalue

= P(xi | vj )i=1

n∏

Page 8: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

NaïveBayes:practicalstuff• Underflowprevention– Multiplyingprobabilitieswillresultinfloatingpointunderflow(roundtozero)

– Summinglogprobabilitieswillsolvetheproblem:

• Feature“tweaking”– “PureBoW”isoftentoonaïve– Stemming/normalization(hate,hated,hating,hates)– Phraseextraction(e.g.,negation,scientific/numericexpr.)– Upweighting– increasethecountsof“important”words

NBDecision:argmaxc_j log P(cj) + ∑i log P(xi | cj)

Page 9: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

LinearClassifiers• Linearthresholdfunctions

– Associateaweight(wi)witheachfeature(xi)– Prediction:sign(b + wTx) = sign (b + Σ wi xi)

• b + wTx ≥ 0predicty=1• Otherwise,predicty=-1

• NBisalinearthresholdfunction– Weightvector(w)isassignedbyestimatingthelabelprobability,

giventhemodelindependencyassumptions• Linearthresholdfunctionsareaverypopularrepresentation!

• GenerativeVs.Discriminativemodels– GenerativemodelsrepresentthejointprobabilityP(x,y),while

discriminativemodelsrepresentP(y|x)directly

Page 10: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Generativevs.Discriminative

• LanguagemodelsandNBareexamplesofgenerativemodels

• GenerativemodelscapturethejointprobabilityoftheinputandoutputsP(x,y) – MostoftheearlyworkinstatisticalNLPisgenerative:Language

models,NB,HMM,BayesianNetworks,PCFG,etc.

• Wethinkaboutgenerativemodelsasthehiddenvariablesgeneratingtheobserveddata

• Super-easytraining=counting! h

x1 x2 xk

NaïveBayes

Page 11: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Generativevs.Discriminative

• Ontheotherhand..• Wedon’tcareaboutthejointprobability,wecanabout

theconditionalprobability– P(y|x)• Conditionalordiscriminative modelscharacterizethe

decisionboundarydirectly(=conditionalprobability).– Workreallywellinpractice,easytoincorporatearbitraryfeatures,..

– SVM,perceptron,LogisticRegression,CRF,…• Trainingisharder(we’llsee..) h

x1 x2 xk

Logisticregression

Page 12: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

PracticalExample

12

Source: Scaling to very very large corpora for natural language disambiguationMichele Banko, Eric Brill. Microsoft Research, Redmond, WA. 2001.

Task: context sensitive spelling{principle , principal},{weather,whether}.

Page 13: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

LinearClassifiers

Eachpointinthisspaceisadocument.

Thecoordinates(e.g.,x1,x2),aredeterminedbyfeatureactivations

++++

+

+

++

- -

- - -- - - - -

++++

+++

sign(b + wTx)

Learning=adjusttheweights(w) tofindagooddecisionboundary

Page 14: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

ExpressivityBytransformingthefeaturespacethesefunctionscanbemadelinear

Representeachpointin2Das(x,x2)

Page 15: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Expressivity

Morerealisticscenario:thedataisalmost linearlyseparable,exceptforsomenoise.

- --

- -- - - - -

++++

+

+

+++

+++

+++

sign(b + wTx)

-

+

--+

++++

+

+

+++

+++

+++

+

-

Page 16: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Features• SofarwehavediscussedBoW representation– Infact,youcanuseaveryrichrepresentation

• Broaderdefinition– FunctionsmappingattributesoftheinputtoaBoolean/categorical/numericvalue

• Question:assumethatyouhavealexicon,containingpositiveandnegativesentimentwords.HowcanyouuseittoimproveoverBoW?

φ1(x) =1 x1 is capitalized0 otherwise

⎧⎨⎩

φk (x) =1 x contains ''good '' more than twice0 otherwise

⎧⎨⎩

Page 17: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Perceptron

• Oneoftheearliestlearningalgorithms– IntroducedbyRosenblatt1958tomodelneurallearning

• Goal:directlysearchforaseparatinghyperplane– Ifoneexists,perceptronwillfindit– Ifnot,…

• Online algorithm– Considersoneexampleatatime(NB– looksatentiredata)

• Errordrivenalgorithm– Updatestheweightsonlywhenamistakeismade

Page 18: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Perceptron

• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn

• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}

18

1. Initializew=0

2.Cyclethroughallexamples

a.Predict thelabelofinstancextobe y’=sgn{w�x)

b.Ify’≠y,update theweightvector:

w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.

Rn∈

Page 19: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

Page 20: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

• Themarginofadataset(𝛾)isthemaximummarginpossibleforthatdatasetusinganyweightvector.

Page 21: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

LearningusingthePerceptronAlgorithm

21

• Perceptronguarantee:findalinearseparator(ifthedataisseparable)

• Therecouldbemanymodelsconsistentwiththetrainingdata– HowDidtheperceptronalgorithmdealwiththisproblem?

• Tradingsometrainingerrorforbettergeneralization– Thisproblemisaggravatedwhenweexplodethefeaturespace

Page 22: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Perceptron

• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn

• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}

22

1. Initializew=0

2.Cyclethroughallexamples

a.Predict thelabelofinstancextobe y’=sgn{w�x)

b.Ify’≠y,update theweightvector:

w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.

Rn∈

HowwouldyoutweakthePerceptrontomakeit- (1)realistic?(2)“better”?

Page 23: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

LearningasOptimization

• DiscriminativeLinearclassifiers– Sofarwelookedperceptron– Combinesmodel(linearrepresentation)withalgorithm(updaterule)

– Let’strytoabstract – wewanttofindalinearfunctionthatperforms‘thebest’onthedata• Whataregoodpropertiesofthisclassifier?• Wanttoexplicitlycontrolforerror+“simplicity”

– Howcanwediscussthesetermsseparatelyfromaspecificalgorithm?• Searchspaceofallpossiblelinearfunctions• Findaspecificfunctionthathascertainproperties..

Page 24: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

LearningasOptimization

24

• Insteadwecanthinkaboutlearningasoptimizinganobjectivefunction(e.g.,minimizemistakes)

• Wecanincorporateotherconsiderationsbymodifyingtheobjectivefunction(regularization)

• Sometimereferredtoasstructuralriskminimization

Page 25: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

LossFunction• Toformalizeperformancelet’sdefinealossfunction:–Whereisthegoldlabel

• Thelossfunctionmeasurestheerroronasingleinstance– Specificdefinitiondependsonthelearningtask

25

loss(y, y)y

loss(y, y) = (y − y)2

Regression

loss(y, y) =

!

0 y = y

otherwise

Binaryclassification

1 otherwise

Page 26: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

0-1Loss

26

1

00-1

y(w*x)

loss

loss(y, f(x)) = 0 iff y = yloss(y, f(x)) = 1 iff y = y

loss(y · f(x)) = 0 iff y · f(x) > 0 (correct)loss(y · f(x)) = 1 iff y · f(x) < 0 (Misclassified)

Page 27: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

TheEmpirical Riskoff(x)

• TheempiricalriskofaclassifieronadatasetDisitsaveragelossontheitemsind

• Realisticlearningobjective:findanf thatminimizestheempiricalrisk

27

RD(f) =1

D

!

D

loss(yi, f(xi))

Page 28: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

EmpiricalRiskMinimization

• Learning:GivenatrainingdatasetD,returntheclassifierf(x)thatminimizestheempiricalrisk

• Giventhisdefinitionwecanviewlearningasaminimizationproblem– Theobjectivefunctiontominimize(empiricalrisk)isdefinedwithrespecttoaspecificlossfunction

– Ourminimizationprocedure(akalearning)willbeinfluencedbythechoiceoflossfunction• Someareeasiertominimizethanother!

28

Page 29: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

ErrorSurface• Linearclassifiers:hypothesisspaceparameterizedbyw• Error/Loss/Riskareallfunctionsofw

29

LocalMinimum

GlobalMinimum

HypothesisSpace

(Empirical)Error

Learning: find the global minimum of the error surface

Page 30: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

ConvexErrorSurfaces

• Convexfunctionshaveasingleminimum point– Localminimum=globalminimum– Easiertooptimize

30

(Empirical)Error

Convex error functions have no local minima

GlobalMinimum

Hypothesisspace

Page 31: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

ErrorSurfaceforSquaredLoss

31

Err(w) =1

2

!

d∈D

(yd − yd)2

Weaddthe½forconvenience

y = wTx

Since is a constant (for a given dataset), the Error function is a quadratic function of W (paraboloid) èSquaredLossfunctionisconvex!

y

How can we find the global minimum? E

rr(w

)

Page 32: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

GradientDescentIntuition

32

w2w3 w4

What is the gradient of Error(w) at this point?

w1

(1) The derivative of the function at w1 is the slope of the tangent lineà Positive slope (increasing)Which direction should we move to decrease the value of Err(w) ?

(2) The gradient determines the direction of steepest increase of Err(w) (go in the opposite direction)

(3) We also need to determine the step size (aka learning rate). What happens if we overshoot?

Page 33: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

NoteaboutGDstepsize

• Settingthestepsizetoaverysmallvalue– Slowconvergencerate

• Settingthestepsizetoalargevalue– Mayoscillate(consistentlyovershoottheglobalmin)

• Tuneexperimentally– Moresophisticated

algorithmsetthevalueautomatically

33

w2w3 w4 w1

Page 34: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

TheGradientofError(w)

34

∇Err(w) =

!

∂Err(w)

∂w0

,∂Err(w)

∂w1

, ...,∂Err(w)

∂wn

"

Thegradientisavectorofpartialderivatives.

ItIndicatesthedirectionofsteepestincrease inErr(w),foreachoneofw’scoordinates

Thegradientisageneralizationofthederivative

Page 35: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

GradientDescentUpdates

• Computethegradientofthetrainingerrorateachiteration– Batchmode:computethegradientoveralltrainingexamples

• Updatew:

35

∇Err(w) =

!

∂Err(w)

∂w0

,∂Err(w)

∂w1

, ...,∂Err(w)

∂wn

"

wi+1 = wi− α∇Err(wi)

Learningrate(>0)

Page 36: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

ComputingErr(wi)forSquaredLoss

36

Δ

Err(w) =1

2

!

d∈D

(yd − f(xd))2

Page 37: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

BatchUpdateRuleforEachwi

• Implementinggradientdescent:asyougothroughthetrainingdata,accumulatethechangeineachwi ofW

37

Page 38: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

GradientDescentforSquaredLoss

38

Page 39: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Batchvs.OnlineLearning

• TheGradientDescentalgorithmmakesupdatesaftergoingovertheentiredataset– Datasetcanbehuge– Streamingmode(wecannotassumewesawallthedata)– Onlinelearningallows“adapting”tochangesinthetargetfunction

• StochasticGradientDescent– SimilartoGD,updatesaftereachexample– CanwemakethesameconvergenceassumptionsasinGD?

• Variations:updateafterasubsetofexamples

39

Page 40: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

StochasticGradientDescent

40

Page 41: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

PolynomialRegression

• GD:generaloptimizationalgorithm–Worksforclassification(differentlossfunction)– Incorporatepolynomialfeaturestofitacomplexmodel

– Danger– overfitting!

41

Page 42: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Regularization

42

• Wewantawaytoexpresspreferenceswhensearchingforaclassificationfunction(i.e.,learning)

• Manyfunctionscanagreewiththedata,whichoneshouldwechoose?– Intuition:prefersimplermodels– Sometimesweareevenwillingtotradeahighererrorrateforasimplermodel(why?)

• Addaregularization term:– Thisisaformofinductive bias

minw =!

n

loss(yn,wn) + λR(w)

Howdifferentvaluesimpactlearning?

minw =!

n

loss(yn,wn) + λR(w)

Page 43: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Regularization• Averypopularchoiceofregularizationtermistominimizethenormoftheweightvector– Forexample– Forconvenience:½squarednorm

• Q:Whatistheupdategradientofthelossfunction?• Ateachiterationwesubtracttheweightsbyλ*w

• Ingeneral,wecanpickothernorms(p-norm)– ReferencedasL-p norm– Differentpwillhaveadifferenteffect!

43

||w|| =

!

"

d

w2

d

minw =!

n

loss(yn,wn) +λ

2||w||2

||w||p =

!

"

d

|wd|p

#1

p

Page 44: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Classification• Sofar:– Generaloptimizationframeworkforlearning– Minimizeregularizedlossfunction

– Gradientdescentisanallpurposetool• Computedthegradientofthesquare lossfunction

• Movingtoclassification shouldbeveryeasy– Simplyreplacethelossfunction

• …

44

minw =!

n

loss(yn,wn) +λ

2||w||2

Page 45: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Classification• Canweminimizetheempiricalerrordirectly?– Usethe0-1lossfunction

• Problem:Cannotbeoptimizeddirectly– Nonconvexandnondifferentiable

• Solution:defineasmoothlossfunction,anupperboundtothe0-1lossfunction

45

Page 46: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

SquareLossisanUpperboundto0/1Loss

46

Isthesquarelossagoodcandidatetobeasurrogatelossfunctionfor0-1loss?

Page 47: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

SurrogateLossfunctions

• Surrogatelossfunction:smoothapproximationtothe0-1loss– Upperboundto0-1loss

47

Page 48: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

LogisticFunction• Smoothversionofthethresholdfunction

• Knownasasigmoid/logisticfunction– Smoothtransitionbetween0-1

• Canbeinterpretedastheconditionalprobability• DecisionBoundary– y=1:h(x)>0.5è wtx>=0– Y=0:h(x)<0.5èwtx<0

48

hw(x) = g(wTx)

z = wTx

g(z) =1

1 + e−z

Sigmoid (logistic)function

Page 49: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

LogisticRegression• Learning:optimizethelikelihoodofthedata

– Likelihood:probabilityofourdataundercurrentparameters– Foreasieroptimization,welookintotheloglikelihood(negative)

• CostFunction– If y = 1: - log P(y=1| x, w) = - log (g(w,x))– Ifthemodelgivesveryhighprobabilitytoy=1,whatisthecost?– If y = 0: - log P(y=0| x, w) = - log 1- P(y=1| x, w) = - log (1- g(w,x))

• Ormoresuccinctly(withl2 regularization)

• Thisfunctionisconvexanddifferentiable– Computethegradient,minimizeusinggradientdescent

`49

Err(w) =!

i

yi log h(xi) + (1− yi)log (1− h(xi))Err(w) = −!

i

yilog(eg(w,xi)) + (1− yi)log(1− eg(w,xi)) +1

2λ||w||2Err(w) = −

!

i

yilog(eg(w,xi)) + (1− yi)log(1− eg(w,xi)) +1

2λ||w||2

Page 50: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

LogisticRegression

• Let’sconsiderthestochasticgradientdescentruleforLR:

• Whatisthedifferencecomparedto:– LinearRegression?– Perceptron?

wj = wj− α(yi − h(xi)) x

j

Page 51: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Nextup:HingeLoss

• Anotherpopularchoiceforlossfunctionisthehingeloss–Wewilldiscussinthecontextofsupportvectormachines(SVM)

51

L(y, f (x)) =max(0,1− y f (x))

It’s easy to observe that:(1) The hinge loss is an upper bound

to the 0-1 loss (2) The hinge loss is a good

approximation for the 0-1 loss(3) BUT …

It is not differentiable at y(wTx)=1Solution: Sub-gradient descent

Page 52: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

MaximalMarginClassificationMotivationforthenotionofmaximalmargin

Definedw.r.t.adatasetS:

Page 53: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

SomeDefinitions

• Margin:distanceoftheclosestpointfromahyperplane

• Ournewobjectivefunction– Learningisessentiallysolving:

γ = minxi,yi

yi(wTxi + b)

||w||

This is known as the geometricmargin, the numerator is knownas the functional margin

maxw

γ

Page 54: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

MaximalMargin

• Wewanttofind

• Observation:wedon’tcareaboutthemagnitudeofw!• Setthefunctionalmarginto1,andminimizeW• maxw isequivalentto minw ||w||inthissetting• Tomakelifeeasy:

let’sminimizeminw ||w||2

54

maxw

γ

x1+x2=010x1+10x2=0100x1+100x2=0

maxw

γ

Page 55: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

HardSVMOptimization

• Thisleadstoawelldefinedconstrainedoptimizationproblem,knownasHardSVM:

• Thisisanoptimizationproblemin(n+1)variables,with|S|=minequalityconstraints.

55

Minimize:½||w||2

Subject to:∀ (x,y) ∈ S:y wT x ≥ 1

Page 56: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Hardvs.SoftSVM

56

Minimize:½||w||2

Subject to:∀ (x,y) ∈ S:y wT x ≥ 1

Minimize:½||w||2 + c ΣιξiSubject to:∀ (x,y) ∈ S:yi wT xi ≥ 1-ξi ;ξi >0,i=1…m

Page 57: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

ObjectiveFunctionforSoftSVM

• Therelaxationoftheconstraint:yi wiTxi ≥ 1 canbe

donebyintroducingaslackvariableξ (perexample)andrequiring: yi wi

Txi ≥ 1 - ξ i; (ξ i ≥ )• Now,wewanttosolve:

Min ½ ||w||2 + c Σ ξi (subject to ξ i ≥ 0)• Whichcanbewrittenas:

Min ½ ||w||2 + c Σ max(0, 1 – yi wT xi). • Whatistheinterpretationofthis?• This is the Hinge loss function

57

Page 58: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Sub-Gradient

58

01

Standard0/1lossPenalizesallincorrectlyclassifiedexampleswiththesameamount

HingelossPenalizesincorrectlyclassifiedexamplesandcorrectlyclassifiedexamplesthatliewithin themargin

NonConvex

Convex,butnotdifferentiableatx=1

Solution: subgradient

01

Examplesthatarecorrectlyclassifiedbutfallwithinthemargin

Thesub-gradientofafunctioncatx0isanyvectorvsuchthat:Atdifferentiable pointsthissetonlycontainsthegradientatx0Intuition:thesetofalltangent lines(linesunderc,touchingcatx0)

ImagefromCIML

Page 59: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Summary• SupportVectorMachine– Findmaxmarginseparator– HardSVMandSoftSVM– Canalsobesolvedinthedual

• Allowsaddingkernels

• Manywaystooptimize!– Current:stochasticmethodsintheprimal,dualcoordinatedescent

• Keyideastoremember:– Learning:Regularization+empiricallossminimization– Surprise:SimilaritiestoPerceptron(withsmallchanges)

59

Page 60: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Summary

• Introducedanoptimizationframeworkforlearning:– Minimizationproblem– Objective:datalosstermandregularizationcostterm– Separatelearningobjectivefromlearningalgorithm

• Manyalgorithmsforminimizingafunction

• Canbeusedforregressionandclassification– Differentlossfunction– GDandSGDalgorithms

• Classification:usesurrogatelossfunction– Smoothapproximationto0-1loss

60

Page 61: Lecture 4 [Autosaved] - Purdue University · Linear Classifiers • Linear threshold functions – Associate a weight (w i) with each feature (x i) – Prediction: sign(b + wTx) =

Questions?