lecture 4 [autosaved] - purdue university · linear classifiers • linear threshold functions –...
TRANSCRIPT
Readingfornexttime
Usingclassificationtounderstandhumaninteractions
SlidetakenfromD.Jurafsky
LinktoreadingmaterialinPiazza,underresources
Classification
• Afundamentalmachine learningtool–WidelyapplicableinNLP
• Supervisedlearning:Learnerisgivenacollectionoflabeleddocuments– Emails:Spam/notspam;Reviews:Pos/Neg
• Buildafunction mappingdocumentstolabels– Keyproperty:Generalization• functionshouldworkwellonnewdata
SentimentAnalysis
Dude, I just watched this horror flick! Selling points: nightmares scenes, torture scenes, terrible monsters that was so bad a##!
Don’t buy the popcorn it was terrible, the monsters selling it must have wanted to torture me, it was so bad it gave me nightmares!
What should your learning algorithm look at?
DeceptiveReviews
FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?
PowerRelations
EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al .WWW2012.
BlahUnacceptableblah
Yourhonor, Iagreeblahblahblah
What should your learning algorithm look at?
NaiveBayes
7
P(x1,x2, ...,xn | vj ) =
= P(x1 | x2, ...,xn , vj)P(x2, ...,xn | vj) = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3, ...,xn | vj) = ....... = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3 | x4 , ...,xn , vj)...P(xn | vj)
VMAP = argmaxv P(x1, x2, …, xn | v )P(v)
Assumption:featurevaluesareindependentgiventhetargetvalue
= P(xi | vj )i=1
n∏
NaïveBayes:practicalstuff• Underflowprevention– Multiplyingprobabilitieswillresultinfloatingpointunderflow(roundtozero)
– Summinglogprobabilitieswillsolvetheproblem:
• Feature“tweaking”– “PureBoW”isoftentoonaïve– Stemming/normalization(hate,hated,hating,hates)– Phraseextraction(e.g.,negation,scientific/numericexpr.)– Upweighting– increasethecountsof“important”words
NBDecision:argmaxc_j log P(cj) + ∑i log P(xi | cj)
LinearClassifiers• Linearthresholdfunctions
– Associateaweight(wi)witheachfeature(xi)– Prediction:sign(b + wTx) = sign (b + Σ wi xi)
• b + wTx ≥ 0predicty=1• Otherwise,predicty=-1
• NBisalinearthresholdfunction– Weightvector(w)isassignedbyestimatingthelabelprobability,
giventhemodelindependencyassumptions• Linearthresholdfunctionsareaverypopularrepresentation!
• GenerativeVs.Discriminativemodels– GenerativemodelsrepresentthejointprobabilityP(x,y),while
discriminativemodelsrepresentP(y|x)directly
Generativevs.Discriminative
• LanguagemodelsandNBareexamplesofgenerativemodels
• GenerativemodelscapturethejointprobabilityoftheinputandoutputsP(x,y) – MostoftheearlyworkinstatisticalNLPisgenerative:Language
models,NB,HMM,BayesianNetworks,PCFG,etc.
• Wethinkaboutgenerativemodelsasthehiddenvariablesgeneratingtheobserveddata
• Super-easytraining=counting! h
x1 x2 xk
NaïveBayes
Generativevs.Discriminative
• Ontheotherhand..• Wedon’tcareaboutthejointprobability,wecanabout
theconditionalprobability– P(y|x)• Conditionalordiscriminative modelscharacterizethe
decisionboundarydirectly(=conditionalprobability).– Workreallywellinpractice,easytoincorporatearbitraryfeatures,..
– SVM,perceptron,LogisticRegression,CRF,…• Trainingisharder(we’llsee..) h
x1 x2 xk
Logisticregression
PracticalExample
12
Source: Scaling to very very large corpora for natural language disambiguationMichele Banko, Eric Brill. Microsoft Research, Redmond, WA. 2001.
Task: context sensitive spelling{principle , principal},{weather,whether}.
LinearClassifiers
Eachpointinthisspaceisadocument.
Thecoordinates(e.g.,x1,x2),aredeterminedbyfeatureactivations
++++
+
+
++
- -
- - -- - - - -
++++
+++
sign(b + wTx)
Learning=adjusttheweights(w) tofindagooddecisionboundary
ExpressivityBytransformingthefeaturespacethesefunctionscanbemadelinear
Representeachpointin2Das(x,x2)
Expressivity
Morerealisticscenario:thedataisalmost linearlyseparable,exceptforsomenoise.
- --
- -- - - - -
++++
+
+
+++
+++
+++
sign(b + wTx)
-
+
--+
++++
+
+
+++
+++
+++
+
-
Features• SofarwehavediscussedBoW representation– Infact,youcanuseaveryrichrepresentation
• Broaderdefinition– FunctionsmappingattributesoftheinputtoaBoolean/categorical/numericvalue
• Question:assumethatyouhavealexicon,containingpositiveandnegativesentimentwords.HowcanyouuseittoimproveoverBoW?
φ1(x) =1 x1 is capitalized0 otherwise
⎧⎨⎩
φk (x) =1 x contains ''good '' more than twice0 otherwise
⎧⎨⎩
Perceptron
• Oneoftheearliestlearningalgorithms– IntroducedbyRosenblatt1958tomodelneurallearning
• Goal:directlysearchforaseparatinghyperplane– Ifoneexists,perceptronwillfindit– Ifnot,…
• Online algorithm– Considersoneexampleatatime(NB– looksatentiredata)
• Errordrivenalgorithm– Updatestheweightsonlywhenamistakeismade
Perceptron
• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn
• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}
18
1. Initializew=0
2.Cyclethroughallexamples
a.Predict thelabelofinstancextobe y’=sgn{w�x)
b.Ify’≠y,update theweightvector:
w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.
Rn∈
Margin
• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.
Margin
• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.
• Themarginofadataset(𝛾)isthemaximummarginpossibleforthatdatasetusinganyweightvector.
LearningusingthePerceptronAlgorithm
21
• Perceptronguarantee:findalinearseparator(ifthedataisseparable)
• Therecouldbemanymodelsconsistentwiththetrainingdata– HowDidtheperceptronalgorithmdealwiththisproblem?
• Tradingsometrainingerrorforbettergeneralization– Thisproblemisaggravatedwhenweexplodethefeaturespace
Perceptron
• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn
• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}
22
1. Initializew=0
2.Cyclethroughallexamples
a.Predict thelabelofinstancextobe y’=sgn{w�x)
b.Ify’≠y,update theweightvector:
w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.
Rn∈
HowwouldyoutweakthePerceptrontomakeit- (1)realistic?(2)“better”?
LearningasOptimization
• DiscriminativeLinearclassifiers– Sofarwelookedperceptron– Combinesmodel(linearrepresentation)withalgorithm(updaterule)
– Let’strytoabstract – wewanttofindalinearfunctionthatperforms‘thebest’onthedata• Whataregoodpropertiesofthisclassifier?• Wanttoexplicitlycontrolforerror+“simplicity”
– Howcanwediscussthesetermsseparatelyfromaspecificalgorithm?• Searchspaceofallpossiblelinearfunctions• Findaspecificfunctionthathascertainproperties..
LearningasOptimization
24
• Insteadwecanthinkaboutlearningasoptimizinganobjectivefunction(e.g.,minimizemistakes)
• Wecanincorporateotherconsiderationsbymodifyingtheobjectivefunction(regularization)
• Sometimereferredtoasstructuralriskminimization
LossFunction• Toformalizeperformancelet’sdefinealossfunction:–Whereisthegoldlabel
• Thelossfunctionmeasurestheerroronasingleinstance– Specificdefinitiondependsonthelearningtask
25
loss(y, y)y
loss(y, y) = (y − y)2
Regression
loss(y, y) =
!
0 y = y
otherwise
Binaryclassification
1 otherwise
0-1Loss
26
1
00-1
y(w*x)
loss
loss(y, f(x)) = 0 iff y = yloss(y, f(x)) = 1 iff y = y
loss(y · f(x)) = 0 iff y · f(x) > 0 (correct)loss(y · f(x)) = 1 iff y · f(x) < 0 (Misclassified)
TheEmpirical Riskoff(x)
• TheempiricalriskofaclassifieronadatasetDisitsaveragelossontheitemsind
• Realisticlearningobjective:findanf thatminimizestheempiricalrisk
27
RD(f) =1
D
!
D
loss(yi, f(xi))
EmpiricalRiskMinimization
• Learning:GivenatrainingdatasetD,returntheclassifierf(x)thatminimizestheempiricalrisk
• Giventhisdefinitionwecanviewlearningasaminimizationproblem– Theobjectivefunctiontominimize(empiricalrisk)isdefinedwithrespecttoaspecificlossfunction
– Ourminimizationprocedure(akalearning)willbeinfluencedbythechoiceoflossfunction• Someareeasiertominimizethanother!
28
ErrorSurface• Linearclassifiers:hypothesisspaceparameterizedbyw• Error/Loss/Riskareallfunctionsofw
29
LocalMinimum
GlobalMinimum
HypothesisSpace
(Empirical)Error
Learning: find the global minimum of the error surface
ConvexErrorSurfaces
• Convexfunctionshaveasingleminimum point– Localminimum=globalminimum– Easiertooptimize
30
(Empirical)Error
Convex error functions have no local minima
GlobalMinimum
Hypothesisspace
ErrorSurfaceforSquaredLoss
31
Err(w) =1
2
!
d∈D
(yd − yd)2
Weaddthe½forconvenience
y = wTx
Since is a constant (for a given dataset), the Error function is a quadratic function of W (paraboloid) èSquaredLossfunctionisconvex!
y
How can we find the global minimum? E
rr(w
)
GradientDescentIntuition
32
w2w3 w4
What is the gradient of Error(w) at this point?
w1
(1) The derivative of the function at w1 is the slope of the tangent lineà Positive slope (increasing)Which direction should we move to decrease the value of Err(w) ?
(2) The gradient determines the direction of steepest increase of Err(w) (go in the opposite direction)
(3) We also need to determine the step size (aka learning rate). What happens if we overshoot?
NoteaboutGDstepsize
• Settingthestepsizetoaverysmallvalue– Slowconvergencerate
• Settingthestepsizetoalargevalue– Mayoscillate(consistentlyovershoottheglobalmin)
• Tuneexperimentally– Moresophisticated
algorithmsetthevalueautomatically
33
w2w3 w4 w1
TheGradientofError(w)
34
∇Err(w) =
!
∂Err(w)
∂w0
,∂Err(w)
∂w1
, ...,∂Err(w)
∂wn
"
Thegradientisavectorofpartialderivatives.
ItIndicatesthedirectionofsteepestincrease inErr(w),foreachoneofw’scoordinates
Thegradientisageneralizationofthederivative
GradientDescentUpdates
• Computethegradientofthetrainingerrorateachiteration– Batchmode:computethegradientoveralltrainingexamples
• Updatew:
35
∇Err(w) =
!
∂Err(w)
∂w0
,∂Err(w)
∂w1
, ...,∂Err(w)
∂wn
"
wi+1 = wi− α∇Err(wi)
Learningrate(>0)
ComputingErr(wi)forSquaredLoss
36
Δ
Err(w) =1
2
!
d∈D
(yd − f(xd))2
BatchUpdateRuleforEachwi
• Implementinggradientdescent:asyougothroughthetrainingdata,accumulatethechangeineachwi ofW
37
GradientDescentforSquaredLoss
38
Batchvs.OnlineLearning
• TheGradientDescentalgorithmmakesupdatesaftergoingovertheentiredataset– Datasetcanbehuge– Streamingmode(wecannotassumewesawallthedata)– Onlinelearningallows“adapting”tochangesinthetargetfunction
• StochasticGradientDescent– SimilartoGD,updatesaftereachexample– CanwemakethesameconvergenceassumptionsasinGD?
• Variations:updateafterasubsetofexamples
39
StochasticGradientDescent
40
PolynomialRegression
• GD:generaloptimizationalgorithm–Worksforclassification(differentlossfunction)– Incorporatepolynomialfeaturestofitacomplexmodel
– Danger– overfitting!
41
Regularization
42
• Wewantawaytoexpresspreferenceswhensearchingforaclassificationfunction(i.e.,learning)
• Manyfunctionscanagreewiththedata,whichoneshouldwechoose?– Intuition:prefersimplermodels– Sometimesweareevenwillingtotradeahighererrorrateforasimplermodel(why?)
• Addaregularization term:– Thisisaformofinductive bias
minw =!
n
loss(yn,wn) + λR(w)
Howdifferentvaluesimpactlearning?
minw =!
n
loss(yn,wn) + λR(w)
Regularization• Averypopularchoiceofregularizationtermistominimizethenormoftheweightvector– Forexample– Forconvenience:½squarednorm
• Q:Whatistheupdategradientofthelossfunction?• Ateachiterationwesubtracttheweightsbyλ*w
• Ingeneral,wecanpickothernorms(p-norm)– ReferencedasL-p norm– Differentpwillhaveadifferenteffect!
43
||w|| =
!
"
d
w2
d
minw =!
n
loss(yn,wn) +λ
2||w||2
||w||p =
!
"
d
|wd|p
#1
p
Classification• Sofar:– Generaloptimizationframeworkforlearning– Minimizeregularizedlossfunction
– Gradientdescentisanallpurposetool• Computedthegradientofthesquare lossfunction
• Movingtoclassification shouldbeveryeasy– Simplyreplacethelossfunction
• …
44
minw =!
n
loss(yn,wn) +λ
2||w||2
Classification• Canweminimizetheempiricalerrordirectly?– Usethe0-1lossfunction
• Problem:Cannotbeoptimizeddirectly– Nonconvexandnondifferentiable
• Solution:defineasmoothlossfunction,anupperboundtothe0-1lossfunction
45
SquareLossisanUpperboundto0/1Loss
46
Isthesquarelossagoodcandidatetobeasurrogatelossfunctionfor0-1loss?
SurrogateLossfunctions
• Surrogatelossfunction:smoothapproximationtothe0-1loss– Upperboundto0-1loss
47
LogisticFunction• Smoothversionofthethresholdfunction
• Knownasasigmoid/logisticfunction– Smoothtransitionbetween0-1
• Canbeinterpretedastheconditionalprobability• DecisionBoundary– y=1:h(x)>0.5è wtx>=0– Y=0:h(x)<0.5èwtx<0
48
hw(x) = g(wTx)
z = wTx
g(z) =1
1 + e−z
Sigmoid (logistic)function
LogisticRegression• Learning:optimizethelikelihoodofthedata
– Likelihood:probabilityofourdataundercurrentparameters– Foreasieroptimization,welookintotheloglikelihood(negative)
• CostFunction– If y = 1: - log P(y=1| x, w) = - log (g(w,x))– Ifthemodelgivesveryhighprobabilitytoy=1,whatisthecost?– If y = 0: - log P(y=0| x, w) = - log 1- P(y=1| x, w) = - log (1- g(w,x))
• Ormoresuccinctly(withl2 regularization)
• Thisfunctionisconvexanddifferentiable– Computethegradient,minimizeusinggradientdescent
`49
Err(w) =!
i
yi log h(xi) + (1− yi)log (1− h(xi))Err(w) = −!
i
yilog(eg(w,xi)) + (1− yi)log(1− eg(w,xi)) +1
2λ||w||2Err(w) = −
!
i
yilog(eg(w,xi)) + (1− yi)log(1− eg(w,xi)) +1
2λ||w||2
LogisticRegression
• Let’sconsiderthestochasticgradientdescentruleforLR:
• Whatisthedifferencecomparedto:– LinearRegression?– Perceptron?
wj = wj− α(yi − h(xi)) x
j
Nextup:HingeLoss
• Anotherpopularchoiceforlossfunctionisthehingeloss–Wewilldiscussinthecontextofsupportvectormachines(SVM)
51
L(y, f (x)) =max(0,1− y f (x))
It’s easy to observe that:(1) The hinge loss is an upper bound
to the 0-1 loss (2) The hinge loss is a good
approximation for the 0-1 loss(3) BUT …
It is not differentiable at y(wTx)=1Solution: Sub-gradient descent
MaximalMarginClassificationMotivationforthenotionofmaximalmargin
Definedw.r.t.adatasetS:
SomeDefinitions
• Margin:distanceoftheclosestpointfromahyperplane
• Ournewobjectivefunction– Learningisessentiallysolving:
γ = minxi,yi
yi(wTxi + b)
||w||
This is known as the geometricmargin, the numerator is knownas the functional margin
maxw
γ
MaximalMargin
• Wewanttofind
• Observation:wedon’tcareaboutthemagnitudeofw!• Setthefunctionalmarginto1,andminimizeW• maxw isequivalentto minw ||w||inthissetting• Tomakelifeeasy:
let’sminimizeminw ||w||2
54
maxw
γ
x1+x2=010x1+10x2=0100x1+100x2=0
maxw
γ
HardSVMOptimization
• Thisleadstoawelldefinedconstrainedoptimizationproblem,knownasHardSVM:
• Thisisanoptimizationproblemin(n+1)variables,with|S|=minequalityconstraints.
55
Minimize:½||w||2
Subject to:∀ (x,y) ∈ S:y wT x ≥ 1
Hardvs.SoftSVM
56
Minimize:½||w||2
Subject to:∀ (x,y) ∈ S:y wT x ≥ 1
Minimize:½||w||2 + c ΣιξiSubject to:∀ (x,y) ∈ S:yi wT xi ≥ 1-ξi ;ξi >0,i=1…m
ObjectiveFunctionforSoftSVM
• Therelaxationoftheconstraint:yi wiTxi ≥ 1 canbe
donebyintroducingaslackvariableξ (perexample)andrequiring: yi wi
Txi ≥ 1 - ξ i; (ξ i ≥ )• Now,wewanttosolve:
Min ½ ||w||2 + c Σ ξi (subject to ξ i ≥ 0)• Whichcanbewrittenas:
Min ½ ||w||2 + c Σ max(0, 1 – yi wT xi). • Whatistheinterpretationofthis?• This is the Hinge loss function
57
Sub-Gradient
58
01
Standard0/1lossPenalizesallincorrectlyclassifiedexampleswiththesameamount
HingelossPenalizesincorrectlyclassifiedexamplesandcorrectlyclassifiedexamplesthatliewithin themargin
NonConvex
Convex,butnotdifferentiableatx=1
Solution: subgradient
01
Examplesthatarecorrectlyclassifiedbutfallwithinthemargin
Thesub-gradientofafunctioncatx0isanyvectorvsuchthat:Atdifferentiable pointsthissetonlycontainsthegradientatx0Intuition:thesetofalltangent lines(linesunderc,touchingcatx0)
ImagefromCIML
Summary• SupportVectorMachine– Findmaxmarginseparator– HardSVMandSoftSVM– Canalsobesolvedinthedual
• Allowsaddingkernels
• Manywaystooptimize!– Current:stochasticmethodsintheprimal,dualcoordinatedescent
• Keyideastoremember:– Learning:Regularization+empiricallossminimization– Surprise:SimilaritiestoPerceptron(withsmallchanges)
59
Summary
• Introducedanoptimizationframeworkforlearning:– Minimizationproblem– Objective:datalosstermandregularizationcostterm– Separatelearningobjectivefromlearningalgorithm
• Manyalgorithmsforminimizingafunction
• Canbeusedforregressionandclassification– Differentlossfunction– GDandSGDalgorithms
• Classification:usesurrogatelossfunction– Smoothapproximationto0-1loss
60
Questions?