lecture 4 [autosaved] - purdue university · linear classifiers • linear threshold functions –...

ML4NLPLearning as Optimization

Dan Goldwasser

CS 590NLP

Purdue Universitydgoldwas@purdue.edu

Readingfornexttime

Usingclassificationtounderstandhumaninteractions

SlidetakenfromD.Jurafsky

LinktoreadingmaterialinPiazza,underresources

Classification

• Afundamentalmachine learningtool–WidelyapplicableinNLP

• Supervisedlearning:Learnerisgivenacollectionoflabeleddocuments– Emails:Spam/notspam;Reviews:Pos/Neg

• Buildafunction mappingdocumentstolabels– Keyproperty:Generalization• functionshouldworkwellonnewdata

SentimentAnalysis

Dude, I just watched this horror flick! Selling points: nightmares scenes, torture scenes, terrible monsters that was so bad a##!

Don’t buy the popcorn it was terrible, the monsters selling it must have wanted to torture me, it was so bad it gave me nightmares!

What should your learning algorithm look at?

DeceptiveReviews

FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?

PowerRelations

EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al .WWW2012.

BlahUnacceptableblah

Yourhonor, Iagreeblahblahblah

What should your learning algorithm look at?

NaiveBayes

P(x1,x2, ...,xn | vj ) =

VMAP = argmaxv P(x1, x2, …, xn | v )P(v)

Assumption:featurevaluesareindependentgiventhetargetvalue

= P(xi | vj )i=1

NaïveBayes:practicalstuff• Underflowprevention– Multiplyingprobabilitieswillresultinfloatingpointunderflow(roundtozero)

– Summinglogprobabilitieswillsolvetheproblem:

• Feature“tweaking”– “PureBoW”isoftentoonaïve– Stemming/normalization(hate,hated,hating,hates)– Phraseextraction(e.g.,negation,scientific/numericexpr.)– Upweighting– increasethecountsof“important”words

NBDecision:argmaxc_j log P(cj) + ∑i log P(xi | cj)

LinearClassifiers• Linearthresholdfunctions

– Associateaweight(wi)witheachfeature(xi)– Prediction:sign(b + wTx) = sign (b + Σ wi xi)

• b + wTx ≥ 0predicty=1• Otherwise,predicty=-1

• NBisalinearthresholdfunction– Weightvector(w)isassignedbyestimatingthelabelprobability,

giventhemodelindependencyassumptions• Linearthresholdfunctionsareaverypopularrepresentation!

• GenerativeVs.Discriminativemodels– GenerativemodelsrepresentthejointprobabilityP(x,y),while

discriminativemodelsrepresentP(y|x)directly

Generativevs.Discriminative

• LanguagemodelsandNBareexamplesofgenerativemodels

• GenerativemodelscapturethejointprobabilityoftheinputandoutputsP(x,y) – MostoftheearlyworkinstatisticalNLPisgenerative:Language

models,NB,HMM,BayesianNetworks,PCFG,etc.

• Wethinkaboutgenerativemodelsasthehiddenvariablesgeneratingtheobserveddata

• Super-easytraining=counting! h

x1 x2 xk

NaïveBayes

Generativevs.Discriminative

• Ontheotherhand..• Wedon’tcareaboutthejointprobability,wecanabout

theconditionalprobability– P(y|x)• Conditionalordiscriminative modelscharacterizethe

decisionboundarydirectly(=conditionalprobability).– Workreallywellinpractice,easytoincorporatearbitraryfeatures,..

– SVM,perceptron,LogisticRegression,CRF,…• Trainingisharder(we’llsee..) h

x1 x2 xk

Logisticregression

PracticalExample

Source: Scaling to very very large corpora for natural language disambiguationMichele Banko, Eric Brill. Microsoft Research, Redmond, WA. 2001.

Task: context sensitive spelling{principle , principal},{weather,whether}.

LinearClassifiers

Eachpointinthisspaceisadocument.

Thecoordinates(e.g.,x1,x2),aredeterminedbyfeatureactivations

- - -- - - - -

sign(b + wTx)

Learning=adjusttheweights(w) tofindagooddecisionboundary

ExpressivityBytransformingthefeaturespacethesefunctionscanbemadelinear

Representeachpointin2Das(x,x2)

Expressivity

Morerealisticscenario:thedataisalmost linearlyseparable,exceptforsomenoise.

- -- - - - -

sign(b + wTx)

Features• SofarwehavediscussedBoW representation– Infact,youcanuseaveryrichrepresentation

• Broaderdefinition– FunctionsmappingattributesoftheinputtoaBoolean/categorical/numericvalue

• Question:assumethatyouhavealexicon,containingpositiveandnegativesentimentwords.HowcanyouuseittoimproveoverBoW?

φ1(x) =1 x1 is capitalized0 otherwise

⎧⎨⎩

φk (x) =1 x contains ''good '' more than twice0 otherwise

⎧⎨⎩

Perceptron

• Oneoftheearliestlearningalgorithms– IntroducedbyRosenblatt1958tomodelneurallearning

• Goal:directlysearchforaseparatinghyperplane– Ifoneexists,perceptronwillfindit– Ifnot,…

• Online algorithm– Considersoneexampleatatime(NB– looksatentiredata)

• Errordrivenalgorithm– Updatestheweightsonlywhenamistakeismade

Perceptron

• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn

• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}

1. Initializew=0

2.Cyclethroughallexamples

a.Predict thelabelofinstancextobe y’=sgn{w�x)

b.Ify’≠y,update theweightvector:

w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

• Themarginofadataset(𝛾)isthemaximummarginpossibleforthatdatasetusinganyweightvector.

LearningusingthePerceptronAlgorithm

• Perceptronguarantee:findalinearseparator(ifthedataisseparable)

• Therecouldbemanymodelsconsistentwiththetrainingdata– HowDidtheperceptronalgorithmdealwiththisproblem?

• Tradingsometrainingerrorforbettergeneralization– Thisproblemisaggravatedwhenweexplodethefeaturespace

Perceptron

• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn

• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}

1. Initializew=0

2.Cyclethroughallexamples

a.Predict thelabelofinstancextobe y’=sgn{w�x)

b.Ify’≠y,update theweightvector:

w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.

HowwouldyoutweakthePerceptrontomakeit- (1)realistic?(2)“better”?

LearningasOptimization

• DiscriminativeLinearclassifiers– Sofarwelookedperceptron– Combinesmodel(linearrepresentation)withalgorithm(updaterule)

– Let’strytoabstract – wewanttofindalinearfunctionthatperforms‘thebest’onthedata• Whataregoodpropertiesofthisclassifier?• Wanttoexplicitlycontrolforerror+“simplicity”

– Howcanwediscussthesetermsseparatelyfromaspecificalgorithm?• Searchspaceofallpossiblelinearfunctions• Findaspecificfunctionthathascertainproperties..

LearningasOptimization

• Insteadwecanthinkaboutlearningasoptimizinganobjectivefunction(e.g.,minimizemistakes)

• Wecanincorporateotherconsiderationsbymodifyingtheobjectivefunction(regularization)

• Sometimereferredtoasstructuralriskminimization

LossFunction• Toformalizeperformancelet’sdefinealossfunction:–Whereisthegoldlabel

• Thelossfunctionmeasurestheerroronasingleinstance– Specificdefinitiondependsonthelearningtask

loss(y, y)y

loss(y, y) = (y − y)2

Regression

loss(y, y) =

0 y = y

otherwise

Binaryclassification

1 otherwise

0-1Loss

y(w*x)

loss(y, f(x)) = 0 iff y = yloss(y, f(x)) = 1 iff y = y

loss(y · f(x)) = 0 iff y · f(x) > 0 (correct)loss(y · f(x)) = 1 iff y · f(x) < 0 (Misclassified)

TheEmpirical Riskoff(x)

• TheempiricalriskofaclassifieronadatasetDisitsaveragelossontheitemsind

• Realisticlearningobjective:findanf thatminimizestheempiricalrisk

RD(f) =1

loss(yi, f(xi))

EmpiricalRiskMinimization

• Learning:GivenatrainingdatasetD,returntheclassifierf(x)thatminimizestheempiricalrisk

• Giventhisdefinitionwecanviewlearningasaminimizationproblem– Theobjectivefunctiontominimize(empiricalrisk)isdefinedwithrespecttoaspecificlossfunction

– Ourminimizationprocedure(akalearning)willbeinfluencedbythechoiceoflossfunction• Someareeasiertominimizethanother!

ErrorSurface• Linearclassifiers:hypothesisspaceparameterizedbyw• Error/Loss/Riskareallfunctionsofw

LocalMinimum

GlobalMinimum

HypothesisSpace

(Empirical)Error

Learning: find the global minimum of the error surface

ConvexErrorSurfaces

• Convexfunctionshaveasingleminimum point– Localminimum=globalminimum– Easiertooptimize

(Empirical)Error

Convex error functions have no local minima

GlobalMinimum

Hypothesisspace

ErrorSurfaceforSquaredLoss

Err(w) =1

(yd − yd)2

Weaddthe½forconvenience

y = wTx

Since is a constant (for a given dataset), the Error function is a quadratic function of W (paraboloid) èSquaredLossfunctionisconvex!

How can we find the global minimum? E

GradientDescentIntuition

w2w3 w4

What is the gradient of Error(w) at this point?

(1) The derivative of the function at w1 is the slope of the tangent lineà Positive slope (increasing)Which direction should we move to decrease the value of Err(w) ?

(2) The gradient determines the direction of steepest increase of Err(w) (go in the opposite direction)

(3) We also need to determine the step size (aka learning rate). What happens if we overshoot?

NoteaboutGDstepsize

• Settingthestepsizetoaverysmallvalue– Slowconvergencerate

• Settingthestepsizetoalargevalue– Mayoscillate(consistentlyovershoottheglobalmin)

• Tuneexperimentally– Moresophisticated

algorithmsetthevalueautomatically

w2w3 w4 w1

TheGradientofError(w)

∇Err(w) =

∂Err(w)

,∂Err(w)

, ...,∂Err(w)

Thegradientisavectorofpartialderivatives.

ItIndicatesthedirectionofsteepestincrease inErr(w),foreachoneofw’scoordinates

Thegradientisageneralizationofthederivative

GradientDescentUpdates

• Computethegradientofthetrainingerrorateachiteration– Batchmode:computethegradientoveralltrainingexamples

• Updatew:

∇Err(w) =

∂Err(w)

,∂Err(w)

, ...,∂Err(w)

wi+1 = wi− α∇Err(wi)

Learningrate(>0)

ComputingErr(wi)forSquaredLoss

Err(w) =1

(yd − f(xd))2

BatchUpdateRuleforEachwi

• Implementinggradientdescent:asyougothroughthetrainingdata,accumulatethechangeineachwi ofW

GradientDescentforSquaredLoss

Batchvs.OnlineLearning

• TheGradientDescentalgorithmmakesupdatesaftergoingovertheentiredataset– Datasetcanbehuge– Streamingmode(wecannotassumewesawallthedata)– Onlinelearningallows“adapting”tochangesinthetargetfunction

• StochasticGradientDescent– SimilartoGD,updatesaftereachexample– CanwemakethesameconvergenceassumptionsasinGD?

• Variations:updateafterasubsetofexamples

StochasticGradientDescent

PolynomialRegression

• GD:generaloptimizationalgorithm–Worksforclassification(differentlossfunction)– Incorporatepolynomialfeaturestofitacomplexmodel

– Danger– overfitting!

Regularization

• Wewantawaytoexpresspreferenceswhensearchingforaclassificationfunction(i.e.,learning)

• Manyfunctionscanagreewiththedata,whichoneshouldwechoose?– Intuition:prefersimplermodels– Sometimesweareevenwillingtotradeahighererrorrateforasimplermodel(why?)

• Addaregularization term:– Thisisaformofinductive bias

minw =!

loss(yn,wn) + λR(w)

Howdifferentvaluesimpactlearning?

minw =!

loss(yn,wn) + λR(w)

Regularization• Averypopularchoiceofregularizationtermistominimizethenormoftheweightvector– Forexample– Forconvenience:½squarednorm

• Q:Whatistheupdategradientofthelossfunction?• Ateachiterationwesubtracttheweightsbyλ*w

• Ingeneral,wecanpickothernorms(p-norm)– ReferencedasL-p norm– Differentpwillhaveadifferenteffect!

||w|| =

minw =!

loss(yn,wn) +λ

2||w||2

||w||p =

Classification• Sofar:– Generaloptimizationframeworkforlearning– Minimizeregularizedlossfunction

– Gradientdescentisanallpurposetool• Computedthegradientofthesquare lossfunction

• Movingtoclassification shouldbeveryeasy– Simplyreplacethelossfunction

• …

minw =!

loss(yn,wn) +λ

2||w||2

Classification• Canweminimizetheempiricalerrordirectly?– Usethe0-1lossfunction

• Problem:Cannotbeoptimizeddirectly– Nonconvexandnondifferentiable

• Solution:defineasmoothlossfunction,anupperboundtothe0-1lossfunction

SquareLossisanUpperboundto0/1Loss

Isthesquarelossagoodcandidatetobeasurrogatelossfunctionfor0-1loss?

SurrogateLossfunctions

• Surrogatelossfunction:smoothapproximationtothe0-1loss– Upperboundto0-1loss

LogisticFunction• Smoothversionofthethresholdfunction

• Knownasasigmoid/logisticfunction– Smoothtransitionbetween0-1

• Canbeinterpretedastheconditionalprobability• DecisionBoundary– y=1:h(x)>0.5è wtx>=0– Y=0:h(x)<0.5èwtx<0

hw(x) = g(wTx)

z = wTx

g(z) =1

1 + e−z

Sigmoid (logistic)function

LogisticRegression• Learning:optimizethelikelihoodofthedata

– Likelihood:probabilityofourdataundercurrentparameters– Foreasieroptimization,welookintotheloglikelihood(negative)

• CostFunction– If y = 1: - log P(y=1| x, w) = - log (g(w,x))– Ifthemodelgivesveryhighprobabilitytoy=1,whatisthecost?– If y = 0: - log P(y=0| x, w) = - log 1- P(y=1| x, w) = - log (1- g(w,x))

• Ormoresuccinctly(withl2 regularization)

• Thisfunctionisconvexanddifferentiable– Computethegradient,minimizeusinggradientdescent

Err(w) =!

yi log h(xi) + (1− yi)log (1− h(xi))Err(w) = −!

yilog(eg(w,xi)) + (1− yi)log(1− eg(w,xi)) +1

2λ||w||2Err(w) = −

yilog(eg(w,xi)) + (1− yi)log(1− eg(w,xi)) +1

2λ||w||2

LogisticRegression

• Let’sconsiderthestochasticgradientdescentruleforLR:

• Whatisthedifferencecomparedto:– LinearRegression?– Perceptron?

wj = wj− α(yi − h(xi)) x

Nextup:HingeLoss

• Anotherpopularchoiceforlossfunctionisthehingeloss–Wewilldiscussinthecontextofsupportvectormachines(SVM)

L(y, f (x)) =max(0,1− y f (x))

It’s easy to observe that:(1) The hinge loss is an upper bound

to the 0-1 loss (2) The hinge loss is a good

approximation for the 0-1 loss(3) BUT …

It is not differentiable at y(wTx)=1Solution: Sub-gradient descent

MaximalMarginClassificationMotivationforthenotionofmaximalmargin

Definedw.r.t.adatasetS:

SomeDefinitions

• Margin:distanceoftheclosestpointfromahyperplane

• Ournewobjectivefunction– Learningisessentiallysolving:

γ = minxi,yi

yi(wTxi + b)

This is known as the geometricmargin, the numerator is knownas the functional margin

MaximalMargin

• Wewanttofind

• Observation:wedon’tcareaboutthemagnitudeofw!• Setthefunctionalmarginto1,andminimizeW• maxw isequivalentto minw ||w||inthissetting• Tomakelifeeasy:

let’sminimizeminw ||w||2

x1+x2=010x1+10x2=0100x1+100x2=0

HardSVMOptimization

• Thisleadstoawelldefinedconstrainedoptimizationproblem,knownasHardSVM:

• Thisisanoptimizationproblemin(n+1)variables,with|S|=minequalityconstraints.

Minimize:½||w||2

Subject to:∀ (x,y) ∈ S:y wT x ≥ 1

Hardvs.SoftSVM

Minimize:½||w||2

Subject to:∀ (x,y) ∈ S:y wT x ≥ 1

Minimize:½||w||2 + c ΣιξiSubject to:∀ (x,y) ∈ S:yi wT xi ≥ 1-ξi ;ξi >0,i=1…m

ObjectiveFunctionforSoftSVM

• Therelaxationoftheconstraint:yi wiTxi ≥ 1 canbe

donebyintroducingaslackvariableξ (perexample)andrequiring: yi wi

Txi ≥ 1 - ξ i; (ξ i ≥ )• Now,wewanttosolve:

Min ½ ||w||2 + c Σ ξi (subject to ξ i ≥ 0)• Whichcanbewrittenas:

Min ½ ||w||2 + c Σ max(0, 1 – yi wT xi). • Whatistheinterpretationofthis?• This is the Hinge loss function

Sub-Gradient

Standard0/1lossPenalizesallincorrectlyclassifiedexampleswiththesameamount

HingelossPenalizesincorrectlyclassifiedexamplesandcorrectlyclassifiedexamplesthatliewithin themargin

NonConvex

Convex,butnotdifferentiableatx=1

Solution: subgradient

Examplesthatarecorrectlyclassifiedbutfallwithinthemargin

Thesub-gradientofafunctioncatx0isanyvectorvsuchthat:Atdifferentiable pointsthissetonlycontainsthegradientatx0Intuition:thesetofalltangent lines(linesunderc,touchingcatx0)

ImagefromCIML

Summary• SupportVectorMachine– Findmaxmarginseparator– HardSVMandSoftSVM– Canalsobesolvedinthedual

• Allowsaddingkernels

• Manywaystooptimize!– Current:stochasticmethodsintheprimal,dualcoordinatedescent

• Keyideastoremember:– Learning:Regularization+empiricallossminimization– Surprise:SimilaritiestoPerceptron(withsmallchanges)

Summary

• Introducedanoptimizationframeworkforlearning:– Minimizationproblem– Objective:datalosstermandregularizationcostterm– Separatelearningobjectivefromlearningalgorithm

• Manyalgorithmsforminimizingafunction

• Canbeusedforregressionandclassification– Differentlossfunction– GDandSGDalgorithms

• Classification:usesurrogatelossfunction– Smoothapproximationto0-1loss

Questions?

lecture 4 [autosaved] - purdue university · linear classifiers • linear threshold functions –...

Documents

m ore c lassifiers. a genda key concepts for all classifiers...

integrated controller wtx-1200a - blue mango...

linear classifiers dept. computer science engineering,...

perceptrons and linear classifiers william cohen 2-4-2008

classification and linear classifiers

optimizing locally linear classifiers with supervised anchor...

wtx antenna -...

ch3: linear classifiers · ch3: linear classifiers w =[w...

linear hyperplanes as classifiers

linear classification and perceptron - cmci.colorado.edu ·...

non-bayes classifiers. linear discriminants, neural networks

linear classifiers/svms

wtx overview 2014

optimal linear and quadratic classifiers =for two

on security and sparsity of linear classifiers for...

linear hyperplanes as classifiers usman roshan. hyperplane...

linear classifiers rubine & ca-linear ruben balcazar

probabilistic linear discriminant...

lecture 2: linear and quadratic classifiers

machine learning – linear classifiers and boosting cs 271:...