cs224n/ling284 - stanford university€¦ · • because h(p) is zero in our case (and even if it...
TRANSCRIPT
![Page 1: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/1.jpg)
Natural Language Processing with Deep Learning
CS224N/Ling284
Lecture 4: Word Window Classification and Neural Networks
Christopher Manning and Richard Socher
Natural Language Processingwith Deep Learning
CS224N/Ling284
Christopher Manning and Richard Socher
Lecture 2: Word Vectors
![Page 2: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/2.jpg)
OverviewToday:
• Classifica(onbackground
• Upda(ngwordvectorsforclassifica(on
• Windowclassifica(on&crossentropyerrorderiva(on(ps
• Asinglelayerneuralnetwork!
• Max-Marginlossandbackprop
ThislecturewillhelpalotwithPSet1:)
![Page 3: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/3.jpg)
Classifica6onsetupandnota6on
• Generallywehaveatrainingdatasetconsis(ngofsamples
{xi,yi}Ni=1
• xi-inputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.
• yi-labelswetrytopredict,forexample• class:sen(ment,nameden((es,buy/selldecision,• otherwords• later:mul(-wordsequences
![Page 4: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/4.jpg)
Classifica6onintui6on
• Trainingdata:{xi,yi}Ni=1
• Simpleillustra(oncase:• Fixed2dwordvectorstoclassify• Usinglogis(cregression• àlineardecisionboundaryà
• GeneralML:assumexisfixed,trainlogis(cregressionweightsWàonlymodifythedecisionboundary
• Goal:predictforeachx:where
Visualiza(onswithConvNetJSbyKarpathy!hZp://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
![Page 5: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/5.jpg)
Detailsoftheso=max
• Wecanteaseapart intotwosteps:
1. Takethey’throwofWandmul(plythatrowwithx:
Computeallfcforc=1,…,C2. Normalizetoobtainprobabilitywithso^maxfunc(on:
=𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑓)↓𝑦
![Page 6: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/6.jpg)
Theso=maxandcross-entropyerror
• Foreachtrainingexample{x,y},ourobjec(veistomaximizetheprobabilityofthecorrectclassy
• Hence,weminimizethenega(velogprobabilityofthatclass:
![Page 7: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/7.jpg)
Background:Why“Crossentropy”error
• Assumingagroundtruth(orgoldortarget)probabilitydistribu(onthatis1attherightclassand0everywhereelse:p=[0,…,0,1,0,…0]andourcomputedprobabilityisq,thenthecrossentropyis:
• Becauseofone-hotp,theonlytermle=isthenega6velogprobabilityofthetrueclass
![Page 8: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/8.jpg)
Sidenote:TheKLdivergence
• Cross-entropycanbere-wriZenintermsoftheentropyandKullback-Leiblerdivergencebetweenthetwodistribu(ons:
• BecauseH(p)iszeroinourcase(andevenifitwasn’titwouldbefixedandhavenocontribu(ontogradient),tominimizethisisequaltominimizingtheKLdivergencebetweenpandq
• TheKLdivergenceisnotadistancebutanon-symmetricmeasureofthedifferencebetweentwoprobabilitydistribu(onspandq
![Page 9: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/9.jpg)
Classifica6onoverafulldataset
• Crossentropylossfunc(onoverfulldataset{xi,yi}Ni=1
• Insteadof
• Wewillwritefinmatrixnota(on:• Wecans(llindexelementsofitbasedonclass
![Page 10: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/10.jpg)
Classifica6on:Regulariza6on!
• Reallyfulllossfunc(onoveranydatasetincludesregulariza6onoverallparameters𝜃:
• Regulariza(onwillpreventoverfigngwhenwehavealotoffeatures(orlateraverypowerful/deepmodel)• x-axis:morepowerfulmodelormoretrainingitera(ons
• Blue:trainingerror,red:testerror
![Page 11: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/11.jpg)
Details:GeneralMLop6miza6on
• Forgeneralmachinelearning𝜃usuallyonlyconsistsofcolumnsofW:
• Soweonlyupdatethedecisionboundary Visualiza(onswithConvNetJSbyKarpathy
![Page 12: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/12.jpg)
Classifica6ondifferencewithwordvectors
• Commonindeeplearning:• LearnbothWandwordvectorsx
Verylarge!
OverfigngDanger!
![Page 13: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/13.jpg)
Losinggeneraliza6onbyre-trainingwordvectors
• Segng:Traininglogis(cregressionformoviereviewsen(mentsinglewordsandinthetrainingdatawehave• “TV”and“telly”
• Inthetes(ngdatawehave• “television”
• Originallytheywereallsimilar(frompre-trainingwordvectors)
• Whathappenswhenwetrainthewordvectors?
TVtelly
television
![Page 14: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/14.jpg)
Losinggeneraliza6onbyre-trainingwordvectors
• Whathappenswhenwetrainthewordvectors?• Thosethatareinthetrainingdatamovearound• Wordsfrompre-trainingthatdoNOTappearintrainingstay
• Example:• Intrainingdata:“TV”and“telly”• Onlyintes(ngdata:“television” TV
telly
television:(
![Page 15: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/15.jpg)
Losinggeneraliza6onbyre-trainingwordvectors
• Takehomemessage:Ifyouonlyhaveasmalltrainingdataset,don’ttrainthewordvectors.Ifyouhavehaveaverylargedataset,itmayworkbeZertotrainwordvectorstothetask.
TVtelly
television
![Page 16: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/16.jpg)
Sidenoteonwordvectorsnota6on
• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings=wordrepresenta(ons(mostly)• Mostlyfrommethodslikeword2vecorGlove
V
L=d ……
aardvarka…meta…zebra• Thesearethewordfeaturesxwordfromnowon
• Newdevelopment(laterintheclass):charactermodels:o
[]
![Page 17: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/17.jpg)
Windowclassifica6on
• Classifyingsinglewordsisrarelydone.
• Interes(ngproblemslikeambiguityariseincontext!
• Example:auto-antonyms:• "Tosanc(on"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."
• Example:ambiguousnameden((es:• ParisàParis,FrancevsParisHilton• HathawayàBerkshireHathawayvsAnneHathaway
![Page 18: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/18.jpg)
Windowclassifica6on
• Idea:classifyawordinitscontextwindowofneighboringwords.
• Forexamplenameden(tyrecogni(oninto4classes:• Person,loca(on,organiza(on,none
• Manypossibili(esexistforclassifyingonewordincontext,e.g.averagingallthewordsinawindowbutthatloosesposi(oninforma(on
![Page 19: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/19.jpg)
Windowclassifica6on
• Trainso^maxclassifierbyassigningalabeltoacenterwordandconcatena(ngallwordvectorssurroundingit
• Example:ClassifyParisinthecontextofthissentencewithwindowlength2:
…museumsinParisareamazing….Xwindow=[xmuseumsxinxParisxarexamazing]T
• Resul(ngvectorxwindow=x∈ R5d,acolumnvector!
![Page 20: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/20.jpg)
Simplestwindowclassifier:So=max
• Withx=xwindowwecanusethesameso^maxclassifierasbefore
• Withcrossentropyerrorasbefore:
• Buthowdoyouupdatethewordvectors?
same
predictedmodeloutputprobability
![Page 21: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/21.jpg)
Upda6ngconcatenatedwordvectors
• Shortanswer:Justtakederiva(vesasbefore
• Longanswer:Let’sgooverstepstogether(helpfulforPSet1)
• Define:• :so^maxprobabilityoutputvector(seepreviousslide)• :targetprobabilitydistribu(on(all0’sexceptatgroundtruthindexofclassy,whereit’s1)
• andfc=c’thelementofthefvector
• Hard,thefirst(me,hencesome(psnow:)
![Page 22: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/22.jpg)
• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!
• Tip2:Chainrule!Ify=f(u)andu=g(x),i.e.y=f(g(x)),then:• Simpleexample:
Upda6ngconcatenatedwordvectors
![Page 23: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/23.jpg)
• Tip2con(nued:Knowthychainrule• Don’tforgetwhichvariablesdependonwhatandthatx
appearsinsideallelementsoff’s
• Tip3:Fortheso^maxpartofthederiva(ve:Firsttakethederiva(vewrtfcwhenc=y(thecorrectclass),thentakederiva(vewrtfcwhenc≠y(alltheincorrectclasses)
Upda6ngconcatenatedwordvectors
![Page 24: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/24.jpg)
• Tip4:Whenyoutakederiva(vewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpar(alderiva(ves:
• Tip5:Tolaternotgoinsane&implementa(on!àresultsintermsofvectoropera(onsanddefinesingleindex-ablevectors:
Upda6ngconcatenatedwordvectors
![Page 25: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/25.jpg)
• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpar(alderiva(vesofe.g.xiorWij
• Tip7:Tocleanitupforevenmorecomplexfunc(onslater:Knowdimensionalityofvariables&simplifyintomatrixnota(on
• Tip8:Writethisoutinfullsumsifit’snotclear!
Upda6ngconcatenatedwordvectors
![Page 26: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/26.jpg)
• Whatisthedimensionalityofthewindowvectorgradient?
• xistheen(rewindow,5d-dimensionalwordvectors,sothederiva(vewrttoxhastohavethesamedimensionality:
Upda6ngconcatenatedwordvectors
![Page 27: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/27.jpg)
• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:
• Let• Withxwindow=[xmuseumsxinxParisxarexamazing]
• Wehave
Upda6ngconcatenatedwordvectors
![Page 28: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/28.jpg)
• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnameden((es.
• Forexample,themodelcanlearnthatseeingxinasthewordjustbeforethecenterwordisindica(veforthecenterwordtobealoca(on
Upda6ngconcatenatedwordvectors
![Page 29: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/29.jpg)
• ThegradientofJwrttheso^maxweightsW!
• Similarsteps,writedownpar(alwrtWijfirst!• Thenwehavefull
What’smissingfortrainingthewindowmodel?
![Page 30: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/30.jpg)
Anoteonmatriximplementa6ons
• Therearetwoexpensiveopera(onsintheso^max:
• Thematrixmul(plica(onandtheexp
• Aforloopisneverasefficientwhenyouimplementitcomparedtoalargematrixmul(plica(on!
• Examplecodeà
![Page 31: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/31.jpg)
Anoteonmatriximplementa6ons
• Loopingoverwordvectorsinsteadofconcatena(ngthemallintoonelargematrixandthenmul(plyingtheso^maxweightswiththatmatrix
• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop
![Page 32: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/32.jpg)
Anoteonmatriximplementa6ons
• ResultoffastermethodisaCxNmatrix:
• Eachcolumnisanf(x)inournota(on(unnormalizedclassscores)
• Matricesareawesome!
• Youshouldspeedtestyourcodealottoo
![Page 33: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/33.jpg)
So=max(=logis6cregression)alonenotverypowerful
• So^maxonlygiveslineardecisionboundariesintheoriginalspace.
• WithliZledatathatcanbeagoodregularizer
• Withmoredataitisverylimi(ng!
![Page 34: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/34.jpg)
So=max(=logis6cregression)isnotverypowerful
• So^maxonlylineardecisionboundaries
• àLamewhenproblem iscomplex
• Wouldn’titbecoolto getthesecorrect?
![Page 35: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/35.jpg)
NeuralNetsfortheWin!
• Neuralnetworkscanlearnmuchmorecomplexfunc(onsandnonlineardecisionboundaries!
![Page 36: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/36.jpg)
Fromlogis6cregressiontoneuralnets
![Page 37: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/37.jpg)
Demys6fyingneuralnetworks
Neuralnetworkscomewiththeirownterminologicalbaggage
Butifyouunderstandhowso^maxmodelswork
Thenyoualreadyunderstandtheopera(onofabasicneuron!
AsingleneuronAcomputa(onalunitwithn(3)inputs
and1outputandparametersW,b
Ac(va(onfunc(on
Inputs
Biasunitcorrespondstointerceptterm
Output
![Page 38: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/38.jpg)
Aneuronisessen6allyabinarylogis6cregressionunit
hw,b(x) = f (wTx + b)
f (z) = 11+ e−z
w,baretheparametersofthisneuroni.e.,thislogis(cregressionmodel
b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm
![Page 39: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/39.jpg)
Aneuralnetwork=runningseverallogis6cregressionsatthesame6meIfwefeedavectorofinputsthroughabunchoflogis(cregressionfunc(ons,thenwegetavectorofoutputs…
Butwedon’thavetodecideaheadofAmewhatvariablestheselogisAcregressionsaretryingtopredict!
![Page 40: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/40.jpg)
Aneuralnetwork=runningseverallogis6cregressionsatthesame6me…whichwecanfeedintoanotherlogis(cregressionfunc(on
ItisthelossfuncAonthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredicAngthetargetsforthenextlayer,etc.
![Page 41: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/41.jpg)
Aneuralnetwork=runningseverallogis6cregressionsatthesame6me
Beforeweknowit,wehaveamul(layerneuralnetwork….
![Page 42: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/42.jpg)
Matrixnota6onforalayer
Wehave
Inmatrixnota(on
wherefisappliedelement-wise:
a1
a2
a3
a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.
z =Wx + ba = f (z)
f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]
W12
b3
![Page 43: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/43.jpg)
Non-lineari6es(f):Whythey’reneeded
• Example: function approximation, e.g., regression or classification • Without non-linearities, deep neural
networks can’t do anything more than a linear transform
• Extra layers could just be compiled down into a single linear transform: W1W2x = Wx
• With more layers, they can approximate more complex functions!
![Page 44: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/44.jpg)
Amorepowerful,neuralnetwindowclassifier
• Revisi(ng
• Xwindow=[xmuseumsxinxParisxarexamazing]
• Assumewewanttoclassifywhetherthecenterwordisaloca(onornot
1/19/17RichardSocherLecture5,Slide44
![Page 45: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/45.jpg)
ASingleLayerNeuralNetwork
• Asinglelayerisacombina(onofalinearlayerandanonlinearity:
• Theneuralac(va(onsacanthenbeusedtocomputesomeoutput
• Forinstance,aprobabilityviaso^max𝑝𝑦𝑥 = softmax(𝑊𝑎)
• Oranunnormalizedscore(evensimpler)
45
![Page 46: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/46.jpg)
Summary:Feed-forwardComputa6on
46
Compu(ngawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)
xwindow=[xmuseumsxinxParisxarexamazing]
![Page 47: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/47.jpg)
Mainintui6onforextralayer
47
Thelayerlearnsnon-linearinterac(onsbetweentheinputwordvectors.
Example:onlyif“museums”isfirstvectorshoulditmaZerthat“in”isinthesecondposi(on
Xwindow=[xmuseumsxinxParisxarexamazing]
![Page 48: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/48.jpg)
Themax-marginloss
• s=score(museumsinParisareamazing)• sc=score(NotallmuseumsinParis)
• Ideafortrainingobjec(ve:makescoreoftruewindowlargerandcorruptwindow’sscorelower(un(lthey’regoodenough):minimize
• Thisiscon(nuous-->wecanuseSGD48
![Page 49: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/49.jpg)
Max-marginObjec6vefunc6on
• Objec(veforasinglewindow:
• Eachwindowwithaloca(onatitscentershouldhaveascore+1higherthananywindowwithoutaloca(onatitscenter
• xxx|ß1à|ooo
• Forfullobjec(vefunc(on:Sampleseveralcorruptwindowspertrueone.Sumoveralltrainingwindows
49
![Page 50: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/50.jpg)
TrainingwithBackpropaga6on
AssumingcostJis>0,computethederiva(vesofsandscwrtalltheinvolvedvariables:U,W,b,x
50
![Page 51: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/51.jpg)
TrainingwithBackpropaga6on
• Let’sconsiderthederiva(veofasingleweightWij
• Thisonlyappearsinsideai
• Forexample:W23isonlyusedtocomputea2
x1 x2x3+1
a1 a2
s U2
W23
51
b2
![Page 52: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/52.jpg)
TrainingwithBackpropaga6on
Deriva(veofweightWij:
52
x1 x2x3+1
a1 a2
s U2
W23
![Page 53: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/53.jpg)
whereforlogis(cf
TrainingwithBackpropaga6on
Deriva(veofsingleweightWij:
Localerrorsignal
Localinputsignal
53
x1 x2x3+1
a1 a2
s U2
W23
![Page 54: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/54.jpg)
• Wewantallcombina(onsofi=1,2andj=1,2,3à?
• Solu(on:Outerproduct:whereisthe“responsibility”orerrorsignalcomingfromeachac(va(ona
TrainingwithBackpropaga6on
• FromsingleweightWijtofullW:
54
x1 x2x3+1
a1 a2
s U2
W23
S
![Page 55: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/55.jpg)
TrainingwithBackpropaga6on
• Forbiasesb,weget:
55
x1 x2x3+1
a1 a2
s U2
W23
![Page 56: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/56.jpg)
TrainingwithBackpropaga6on
56
That’salmostbackpropaga(onIt’stakingderiva(vesandusingthechainrule
Remainingtrick:wecanre-usederiva(vescomputedforhigherlayersincompu(ngderiva(vesforlowerlayers!
Example:lastderiva(vesofmodel,thewordvectorsinx
![Page 57: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/57.jpg)
TrainingwithBackpropaga6on
• Takederiva(veofscorewithrespecttosingleelementofwordvector
• Now,wecannotjusttakeintoconsidera(ononeaibecauseeachxjisconnectedtoalltheneuronsaboveandhencexjinfluencestheoverallscorethroughallofthese,hence:
Re-usedpartofpreviousderiva(ve57
![Page 58: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/58.jpg)
TrainingwithBackpropaga6on
• With,whatisthefullgradient?à
• Observa(ons:Theerrormessage𝛿thatarrivesatahiddenlayerhasthesamedimensionalityasthathiddenlayer
58
![Page 59: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/59.jpg)
Pu[ngallgradientstogether:
• Remember:Fullobjec(vefunc(onforeachwindowwas:
• Forexample:gradientforU:
59
![Page 60: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/60.jpg)
Summary
Congrats!Superusefulbasiccomponentsandrealmodel
• Wordvectortraining
• Windows
• So^maxandcrossentropyerror àPSet1
• Scoresandmax-marginloss
• Neuralnetwork àPSet1
Onemorehalfofamath-heavylecture
Thentherestwillbeeasierandmoreapplied:)
![Page 61: CS224N/Ling284 - Stanford University€¦ · • Because H(p) is zero in our case (and even if it wasn’t it would be fixed and have no contribu(on to gradient), to minimize this](https://reader033.vdocument.in/reader033/viewer/2022050323/5f7d4b45c18d2b69b02db165/html5/thumbnails/61.jpg)
Nextlecture:
Projectadvice
Takingmoreanddeeperderiva6vesàFullBackprop
Thenwehaveallthebasictoolsinplacetolearnaboutmorecomplexmodelsandhavesomefun:)