lecture 4 [autosaved] - purdue university · linear classifiers • linear threshold functions –...

ML4NLPLearning as Optimization

Dan Goldwasser

CS 590NLP

Purdue [email protected]

Readingfornexttime

Usingclassificationtounderstandhumaninteractions

SlidetakenfromD.Jurafsky

LinktoreadingmaterialinPiazza,underresources

Classification

• Afundamentalmachine learningtool–WidelyapplicableinNLP

• Supervisedlearning:Learnerisgivenacollectionoflabeleddocuments– Emails:Spam/notspam;Reviews:Pos/Neg

• Buildafunction mappingdocumentstolabels– Keyproperty:Generalization• functionshouldworkwellonnewdata

SentimentAnalysis

Dude, I just watched this horror flick! Selling points: nightmares scenes, torture scenes, terrible monsters that was so bad a##!

Don’t buy the popcorn it was terrible, the monsters selling it must have wanted to torture me, it was so bad it gave me nightmares!

What should your learning algorithm look at?

DeceptiveReviews

FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?

PowerRelations

EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al .WWW2012.

BlahUnacceptableblah

Yourhonor, Iagreeblahblahblah

What should your learning algorithm look at?

NaïveBayes:practicalstuff• Underflowprevention– Multiplyingprobabilitieswillresultinfloatingpointunderflow(roundtozero)

– Summinglogprobabilitieswillsolvetheproblem:

• Feature“tweaking”– “PureBoW”isoftentoonaïve– Stemming/normalization(hate,hated,hating,hates)– Phraseextraction(e.g.,negation,scientific/numericexpr.)– Upweighting– increasethecountsof“important”words

NBDecision:argmaxc_j log P(cj) + ∑i log P(xi | cj)

LinearClassifiers• Linearthresholdfunctions

– Associateaweight(wi)witheachfeature(xi)– Prediction:sign(b + wTx) = sign (b + Σ wi xi)

• b + wTx ≥ 0predicty=1• Otherwise,predicty=-1

• NBisalinearthresholdfunction– Weightvector(w)isassignedbyestimatingthelabelprobability,

giventhemodelindependencyassumptions• Linearthresholdfunctionsareaverypopularrepresentation!

• GenerativeVs.Discriminativemodels– GenerativemodelsrepresentthejointprobabilityP(x,y),while

discriminativemodelsrepresentP(y|x)directly

Generativevs.Discriminative

• LanguagemodelsandNBareexamplesofgenerativemodels

• GenerativemodelscapturethejointprobabilityoftheinputandoutputsP(x,y) – MostoftheearlyworkinstatisticalNLPisgenerative:Language

models,NB,HMM,BayesianNetworks,PCFG,etc.

• Wethinkaboutgenerativemodelsasthehiddenvariablesgeneratingtheobserveddata

• Super-easytraining=counting! h

x1 x2 xk

NaïveBayes

Generativevs.Discriminative

• Ontheotherhand..• Wedon’tcareaboutthejointprobability,wecanabout

theconditionalprobability– P(y|x)• Conditionalordiscriminative modelscharacterizethe

decisionboundarydirectly(=conditionalprobability).– Workreallywellinpractice,easytoincorporatearbitraryfeatures,..

– SVM,perceptron,LogisticRegression,CRF,…• Trainingisharder(we’llsee..) h

x1 x2 xk

Logisticregression

PracticalExample

12

Source: Scaling to very very large corpora for natural language disambiguationMichele Banko, Eric Brill. Microsoft Research, Redmond, WA. 2001.

Task: context sensitive spelling{principle , principal},{weather,whether}.

LinearClassifiers

Eachpointinthisspaceisadocument.

Thecoordinates(e.g.,x1,x2),aredeterminedbyfeatureactivations

++++

+

+

++

- -

- - -- - - - -

++++

+++

sign(b + wTx)

Learning=adjusttheweights(w) tofindagooddecisionboundary

ExpressivityBytransformingthefeaturespacethesefunctionscanbemadelinear

Representeachpointin2Das(x,x2)

Expressivity

Morerealisticscenario:thedataisalmost linearlyseparable,exceptforsomenoise.

- --

- -- - - - -

++++

+

+

+++

+++

+++

sign(b + wTx)

-

+

--+

++++

+

+

+++

+++

+++

+

-

Features• SofarwehavediscussedBoW representation– Infact,youcanuseaveryrichrepresentation

• Broaderdefinition– FunctionsmappingattributesoftheinputtoaBoolean/categorical/numericvalue

• Question:assumethatyouhavealexicon,containingpositiveandnegativesentimentwords.HowcanyouuseittoimproveoverBoW?

φ1(x) =1 x1 is capitalized0 otherwise

⎧⎨⎩

φk (x) =1 x contains ''good '' more than twice0 otherwise

⎧⎨⎩

Perceptron

• Oneoftheearliestlearningalgorithms– IntroducedbyRosenblatt1958tomodelneurallearning

• Goal:directlysearchforaseparatinghyperplane– Ifoneexists,perceptronwillfindit– Ifnot,…

• Online algorithm– Considersoneexampleatatime(NB– looksatentiredata)

• Errordrivenalgorithm– Updatestheweightsonlywhenamistakeismade

Perceptron

• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn

• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}

18

1. Initializew=0

2.Cyclethroughallexamples

a.Predict thelabelofinstancextobe y’=sgn{w�x)

b.Ify’≠y,update theweightvector:

w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.

Rn∈

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

• Themarginofadataset(𝛾)isthemaximummarginpossibleforthatdatasetusinganyweightvector.

LearningusingthePerceptronAlgorithm

21

• Perceptronguarantee:findalinearseparator(ifthedataisseparable)

• Therecouldbemanymodelsconsistentwiththetrainingdata– HowDidtheperceptronalgorithmdealwiththisproblem?

• Tradingsometrainingerrorforbettergeneralization– Thisproblemisaggravatedwhenweexplodethefeaturespace

Perceptron

• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn

• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}

22

1. Initializew=0

2.Cyclethroughallexamples

a.Predict thelabelofinstancextobe y’=sgn{w�x)

b.Ify’≠y,update theweightvector:

w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.

Rn∈

HowwouldyoutweakthePerceptrontomakeit- (1)realistic?(2)“better”?

LearningasOptimization

• DiscriminativeLinearclassifiers– Sofarwelookedperceptron– Combinesmodel(linearrepresentation)withalgorithm(updaterule)

– Let’strytoabstract – wewanttofindalinearfunctionthatperforms‘thebest’onthedata• Whataregoodpropertiesofthisclassifier?• Wanttoexplicitlycontrolforerror+“simplicity”

– Howcanwediscussthesetermsseparatelyfromaspecificalgorithm?• Searchspaceofallpossiblelinearfunctions• Findaspecificfunctionthathascertainproperties..

LearningasOptimization

24

• Insteadwecanthinkaboutlearningasoptimizinganobjectivefunction(e.g.,minimizemistakes)

• Wecanincorporateotherconsiderationsbymodifyingtheobjectivefunction(regularization)

• Sometimereferredtoasstructuralriskminimization

LossFunction• Toformalizeperformancelet’sdefinealossfunction:–Whereisthegoldlabel

• Thelossfunctionmeasurestheerroronasingleinstance– Specificdefinitiondependsonthelearningtask

25

loss(y, y)y

loss(y, y) = (y − y)2

Regression

loss(y, y) =

!

0 y = y

otherwise

Binaryclassification

1 otherwise

0-1Loss

26

1

00-1

y(w*x)

loss

loss(y, f(x)) = 0 iff y = yloss(y, f(x)) = 1 iff y = y

loss(y · f(x)) = 0 iff y · f(x) > 0 (correct)loss(y · f(x)) = 1 iff y · f(x) < 0 (Misclassified)

TheEmpirical Riskoff(x)

• TheempiricalriskofaclassifieronadatasetDisitsaveragelossontheitemsind

• Realisticlearningobjective:findanf thatminimizestheempiricalrisk

27

RD(f) =1

D

!

D

loss(yi, f(xi))

EmpiricalRiskMinimization

• Learning:GivenatrainingdatasetD,returntheclassifierf(x)thatminimizestheempiricalrisk

• Giventhisdefinitionwecanviewlearningasaminimizationproblem– Theobjectivefunctiontominimize(empiricalrisk)isdefinedwithrespecttoaspecificlossfunction

– Ourminimizationprocedure(akalearning)willbeinfluencedbythechoiceoflossfunction• Someareeasiertominimizethanother!

28

ErrorSurface• Linearclassifiers:hypothesisspaceparameterizedbyw• Error/Loss/Riskareallfunctionsofw

29

LocalMinimum

GlobalMinimum

HypothesisSpace

(Empirical)Error

Learning: find the global minimum of the error surface

ConvexErrorSurfaces

• Convexfunctionshaveasingleminimum point– Localminimum=globalminimum– Easiertooptimize

30

(Empirical)Error

Convex error functions have no local minima

GlobalMinimum

Hypothesisspace

ErrorSurfaceforSquaredLoss

31

Err(w) =1

2

!

d∈D

(yd − yd)2

Weaddthe½forconvenience

y = wTx

Since is a constant (for a given dataset), the Error function is a quadratic function of W (paraboloid) èSquaredLossfunctionisconvex!

y

How can we find the global minimum? E

rr(w

)

GradientDescentIntuition

32

w2w3 w4

What is the gradient of Error(w) at this point?

w1

(1) The derivative of the function at w1 is the slope of the tangent lineà Positive slope (increasing)Which direction should we move to decrease the value of Err(w) ?

(2) The gradient determines the direction of steepest increase of Err(w) (go in the opposite direction)

(3) We also need to determine the step size (aka learning rate). What happens if we overshoot?

NoteaboutGDstepsize

• Settingthestepsizetoaverysmallvalue– Slowconvergencerate

• Settingthestepsizetoalargevalue– Mayoscillate(consistentlyovershoottheglobalmin)

• Tuneexperimentally– Moresophisticated

algorithmsetthevalueautomatically

33

w2w3 w4 w1

TheGradientofError(w)

34

∇Err(w) =

!

∂Err(w)

∂w0

,∂Err(w)

∂w1

, ...,∂Err(w)

∂wn

"

Thegradientisavectorofpartialderivatives.

ItIndicatesthedirectionofsteepestincrease inErr(w),foreachoneofw’scoordinates

Thegradientisageneralizationofthederivative

GradientDescentUpdates

• Computethegradientofthetrainingerrorateachiteration– Batchmode:computethegradientoveralltrainingexamples

• Updatew:

35

∇Err(w) =

!

∂Err(w)

∂w0

,∂Err(w)

∂w1

, ...,∂Err(w)

∂wn

"

wi+1 = wi− α∇Err(wi)

Learningrate(>0)

ComputingErr(wi)forSquaredLoss

36

Δ

Err(w) =1

2

!

d∈D

(yd − f(xd))2

BatchUpdateRuleforEachwi

• Implementinggradientdescent:asyougothroughthetrainingdata,accumulatethechangeineachwi ofW

37

GradientDescentforSquaredLoss

38

Batchvs.OnlineLearning

• TheGradientDescentalgorithmmakesupdatesaftergoingovertheentiredataset– Datasetcanbehuge– Streamingmode(wecannotassumewesawallthedata)– Onlinelearningallows“adapting”tochangesinthetargetfunction

• StochasticGradientDescent– SimilartoGD,updatesaftereachexample– CanwemakethesameconvergenceassumptionsasinGD?

• Variations:updateafterasubsetofexamples

39

StochasticGradientDescent

40

PolynomialRegression

• GD:generaloptimizationalgorithm–Worksforclassification(differentlossfunction)– Incorporatepolynomialfeaturestofitacomplexmodel

– Danger– overfitting!

41

Regularization

42

• Wewantawaytoexpresspreferenceswhensearchingforaclassificationfunction(i.e.,learning)

• Manyfunctionscanagreewiththedata,whichoneshouldwechoose?– Intuition:prefersimplermodels– Sometimesweareevenwillingtotradeahighererrorrateforasimplermodel(why?)

• Addaregularization term:– Thisisaformofinductive bias

minw =!

n

loss(yn,wn) + λR(w)

Howdifferentvaluesimpactlearning?

minw =!

n

loss(yn,wn) + λR(w)

Regularization• Averypopularchoiceofregularizationtermistominimizethenormoftheweightvector– Forexample– Forconvenience:½squarednorm

• Q:Whatistheupdategradientofthelossfunction?• Ateachiterationwesubtracttheweightsbyλ*w

• Ingeneral,wecanpickothernorms(p-norm)– ReferencedasL-p norm– Differentpwillhaveadifferenteffect!

43

||w|| =

!

"

d

w2

d

minw =!

n

loss(yn,wn) +λ

2||w||2

||w||p =

!

"

d

|wd|p

#1

p

Classification• Sofar:– Generaloptimizationframeworkforlearning– Minimizeregularizedlossfunction

– Gradientdescentisanallpurposetool• Computedthegradientofthesquare lossfunction

• Movingtoclassification shouldbeveryeasy– Simplyreplacethelossfunction

• …

44

minw =!

n

loss(yn,wn) +λ

2||w||2

Classification• Canweminimizetheempiricalerrordirectly?– Usethe0-1lossfunction

• Problem:Cannotbeoptimizeddirectly– Nonconvexandnondifferentiable

• Solution:defineasmoothlossfunction,anupperboundtothe0-1lossfunction

45

SquareLossisanUpperboundto0/1Loss

46

Isthesquarelossagoodcandidatetobeasurrogatelossfunctionfor0-1loss?

SurrogateLossfunctions

• Surrogatelossfunction:smoothapproximationtothe0-1loss– Upperboundto0-1loss

47

LogisticFunction• Smoothversionofthethresholdfunction

• Knownasasigmoid/logisticfunction– Smoothtransitionbetween0-1

• Canbeinterpretedastheconditionalprobability• DecisionBoundary– y=1:h(x)>0.5è wtx>=0– Y=0:h(x)<0.5èwtx<0

48

hw(x) = g(wTx)

z = wTx

g(z) =1

1 + e−z

Sigmoid (logistic)function

LogisticRegression• Learning:optimizethelikelihoodofthedata

– Likelihood:probabilityofourdataundercurrentparameters– Foreasieroptimization,welookintotheloglikelihood(negative)

• CostFunction– If y = 1: - log P(y=1| x, w) = - log (g(w,x))– Ifthemodelgivesveryhighprobabilitytoy=1,whatisthecost?– If y = 0: - log P(y=0| x, w) = - log 1- P(y=1| x, w) = - log (1- g(w,x))

• Ormoresuccinctly(withl2 regularization)

• Thisfunctionisconvexanddifferentiable– Computethegradient,minimizeusinggradientdescent

`49

Err(w) =!

i

yi log h(xi) + (1− yi)log (1− h(xi))Err(w) = −!

i

yilog(eg(w,xi)) + (1− yi)log(1− eg(w,xi)) +1

2λ||w||2Err(w) = −

!

i

yilog(eg(w,xi)) + (1− yi)log(1− eg(w,xi)) +1

2λ||w||2

LogisticRegression

• Let’sconsiderthestochasticgradientdescentruleforLR:

• Whatisthedifferencecomparedto:– LinearRegression?– Perceptron?

wj = wj− α(yi − h(xi)) x

j

Nextup:HingeLoss

• Anotherpopularchoiceforlossfunctionisthehingeloss–Wewilldiscussinthecontextofsupportvectormachines(SVM)

51

L(y, f (x)) =max(0,1− y f (x))

It’s easy to observe that:(1) The hinge loss is an upper bound

to the 0-1 loss (2) The hinge loss is a good

approximation for the 0-1 loss(3) BUT …

It is not differentiable at y(wTx)=1Solution: Sub-gradient descent

MaximalMarginClassificationMotivationforthenotionofmaximalmargin

Definedw.r.t.adatasetS:

SomeDefinitions

• Margin:distanceoftheclosestpointfromahyperplane

• Ournewobjectivefunction– Learningisessentiallysolving:

γ = minxi,yi

yi(wTxi + b)

||w||

This is known as the geometricmargin, the numerator is knownas the functional margin

maxw

γ

MaximalMargin

• Wewanttofind

• Observation:wedon’tcareaboutthemagnitudeofw!• Setthefunctionalmarginto1,andminimizeW• maxw isequivalentto minw ||w||inthissetting• Tomakelifeeasy:

let’sminimizeminw ||w||2

54

maxw

γ

x1+x2=010x1+10x2=0100x1+100x2=0

maxw

γ

HardSVMOptimization

• Thisleadstoawelldefinedconstrainedoptimizationproblem,knownasHardSVM:

• Thisisanoptimizationproblemin(n+1)variables,with|S|=minequalityconstraints.

55

Minimize:½||w||2

Subject to:∀ (x,y) ∈ S:y wT x ≥ 1

Hardvs.SoftSVM

56

Minimize:½||w||2

Subject to:∀ (x,y) ∈ S:y wT x ≥ 1

Minimize:½||w||2 + c ΣιξiSubject to:∀ (x,y) ∈ S:yi wT xi ≥ 1-ξi ;ξi >0,i=1…m

ObjectiveFunctionforSoftSVM

• Therelaxationoftheconstraint:yi wiTxi ≥ 1 canbe

donebyintroducingaslackvariableξ (perexample)andrequiring: yi wi

Txi ≥ 1 - ξ i; (ξ i ≥ )• Now,wewanttosolve:

Min ½ ||w||2 + c Σ ξi (subject to ξ i ≥ 0)• Whichcanbewrittenas:

Min ½ ||w||2 + c Σ max(0, 1 – yi wT xi). • Whatistheinterpretationofthis?• This is the Hinge loss function

57

Sub-Gradient

58

01

Standard0/1lossPenalizesallincorrectlyclassifiedexampleswiththesameamount

HingelossPenalizesincorrectlyclassifiedexamplesandcorrectlyclassifiedexamplesthatliewithin themargin

NonConvex

Convex,butnotdifferentiableatx=1

Solution: subgradient

01

Examplesthatarecorrectlyclassifiedbutfallwithinthemargin

Thesub-gradientofafunctioncatx0isanyvectorvsuchthat:Atdifferentiable pointsthissetonlycontainsthegradientatx0Intuition:thesetofalltangent lines(linesunderc,touchingcatx0)

ImagefromCIML

Summary• SupportVectorMachine– Findmaxmarginseparator– HardSVMandSoftSVM– Canalsobesolvedinthedual

• Allowsaddingkernels

• Manywaystooptimize!– Current:stochasticmethodsintheprimal,dualcoordinatedescent

• Keyideastoremember:– Learning:Regularization+empiricallossminimization– Surprise:SimilaritiestoPerceptron(withsmallchanges)

59

Summary

• Introducedanoptimizationframeworkforlearning:– Minimizationproblem– Objective:datalosstermandregularizationcostterm– Separatelearningobjectivefromlearningalgorithm

• Manyalgorithmsforminimizingafunction

• Canbeusedforregressionandclassification– Differentlossfunction– GDandSGDalgorithms

• Classification:usesurrogatelossfunction– Smoothapproximationto0-1loss

60

Questions?

lecture 4 [autosaved] - purdue university · linear classifiers • linear threshold functions –...

Documents