ttic 31190: natural language processingkgimpel/teaching/31190/... · 2016. 2. 2. · • predicting...
TRANSCRIPT
TTIC31190:NaturalLanguageProcessing
KevinGimpelWinter2016
Lecture9:SequenceModels
1
Announcements• onThursday,classwillbeinRoom530(theroomdirectlybehindyou)
2
Announcements• wewillgooverpartofAssignment1today(gradescomingsoon)
• Assignment2wasdueWed.Feb.3,nowdueFri.,Feb.5
• projectproposaldueTuesday,Feb.16• midtermonThursday,Feb.18
3
• qualityofscientificjournalism:
4
OtherNaturally-OccurringData
OtherNaturally-OccurringData• memorabilityofquotations:
5
OtherNaturally-OccurringData• sarcasm(remove#sarcasmhashtagfromtweets):
6
OtherNaturally-OccurringData• openingweekendmovierevenuepredictionfromcriticreviews:
7
OtherNaturally-OccurringData• predictingnovelsuccessfromtextofnovels:
8
ProjectProposal• dueFeb.16(intwoweeks)• 1-2pages• onepergroup• includethefollowing:– membersofyourgroup– describethetaskyouaregoingtoworkon(couldbeanewtaskyoucreateoranexistingtask)
– describethemethodsyouwilluse/developforthetask– giveabriefreviewofrelatedwork;i.e.,situateyourprojectwithrespecttotheliterature(www.aclweb.organdGoogleScholarareusefulforthis!)
– aproposedtimeline
9
ProjectProposal(cont’d)
• yourresultsdonothavetobeatthestate-of-the-art!
• butyourprojectdoeshavetobecarefullydone,sothatyoucandrawconclusions
• youarewelcometostartbyreplicatinganNLPpaper(Icangivesuggestionsifyouneedsome)
• duringtheweekofFeb.22,pleasescheduleameetingwithmetodiscussyourproject– detailstofollow
10
ClassPresentations• finaltwoclassmeetings(March3rd andMarch8th)willbemostlyusedforin-classpresentations
• onepresentationpergroup• 10-15minutesperpresentation(willbedeterminedonceIknowhowmanygroupsthereare)
• youwilleachtakenotesandemailmequestions/feedbackforthepresenter,whichIwillanonymize andsend
11
Project• finalreportdueThursday,March17(originaldateofthefinalexam)
• sothepresentationwillbemorelikean“interimprogressreport”
12
Roadmap• classification• words• lexicalsemantics• languagemodeling• sequencelabeling• neuralnetworkmethodsinNLP• syntaxandsyntacticparsing• semanticcompositionality• semanticparsing• unsupervisedlearning• machinetranslationandotherapplications
13
determinerverb(past)prep.properproperposs.adj.noun
modalverbdet.adjectivenounprep.properpunc.
14
Part-of-SpeechTagging
determinerverb(past)prep.nounnounposs.adj.nounSomequestionedifTimCook’sfirstproduct
modalverbdet.adjectivenounprep.nounpunc.wouldbeabreakawayhitforApple.
Simplestkindofstructuredprediction:SequenceLabeling
15
OOOB-PERSONI-PERSONOOOSomequestionedifTimCook’sfirstproduct
OOOOOOB-ORGANIZATIONOwouldbeabreakawayhitforApple.
NamedEntityRecognition
B=“begin”I=“inside”O=“outside”
FormulatingsegmentationtasksassequencelabelingviaB-I-Olabeling:
• therearemanydownloadablepart-of-speechtaggersandnamedentityrecognizers:– StanfordPOStagger,NERlabeler– TurboTagger,TurboEntityRecognizer– IllinoisEntityTagger– CMUTwitterPOStagger– AlanRitter’sTwitterPOS/NERlabeler
16
17
HiddenMarkovModels
18
y1 y2 y3 y4
x1 x2 x3 x4
transitionparameters:
emissionparameters:
HMMsforWordClustering(Brownetal.,1992)
19
eachisaclusterIDso,labelspaceis
justin bieber forpresident
y1 y2 y3 y4
HMMsforPart-of-SpeechTagging
20
eachisapart-of-speechtagso,labelspaceis
whatparametersneedtobelearned?
transitionparameters:
emissionparameters:
justin bieber forpresident
propernoun
propernoun
prepo-sition
noun
HowshouldwelearntheHMMparameters?
21
transitionparameters:
emissionparameters:
SupervisedHMMs• givenadatasetofinputsequencesandannotatedoutputs:
• toestimatetransition/emissiondistributions,usemaximumlikelihoodestimation(countandnormalize):
22
justin bieber forpresident
propernoun
propernoun
prepo-sition
noun
EstimatesofTagTransitionProbabilities
23
proper modalinfinitive adjectivenoun adverbdeterminernoun verbverb
EstimatesofEmissionProbabilities
24
InferenceinHMMs
25
• sincetheoutputisasequence,thisargmaxrequiresiteratingoveranexponentially-largeset
• lastweekwetalkedaboutusingdynamicprogramming(DP)tosolvetheseproblems
• forHMMs(andothersequencemodels),theforsolvingthisiscalledtheViterbialgorithm
ViterbiAlgorithm• recursiveequations+memoization:
26
basecase:returnsprobabilityofsequencestartingwithlabely forfirstword
recursivecase:computesprobabilityofmax-probabilitylabelsequencethatendswithlabely atpositionm
finalvalueisin:
Example:
Janetwillbackthebill
27
proper modalinfinitive determiner nounnoun verbverb
Janetwillbackthebill
28
proper modalinfinitive determiner nounnoun verbverb
ViterbiAlgorithm(onboard)
29
ViterbiAlgorithm• spaceandtimecomplexity?• canbereadofffromtherecursiveequations:
30
spacecomplexity:sizeofmemoization table,whichis#ofuniqueindicesofrecursiveequations
so,spacecomplexityisO(|x||L|)
lengthofsentence
numberoflabels*
ViterbiAlgorithm• spaceandtimecomplexity?• canbereadofffromtherecursiveequations:
31
timecomplexity:sizeofmemoization table*complexityofcomputingeachentry
so,timecomplexityisO(|x||L||L|)=O(|x||L|2)
lengthofsentence
numberoflabels*
eachentryrequiresiteratingthroughthelabels*
LinearSequenceModels
• let’sgeneralizeHMMsandtalkaboutlinearmodelsforscoringlabelsequencesinourclassifierframework:
• butfirst,howdoweknowthatthisscoringfunctiongeneralizesHMMs?
32
HMMasaLinearModel
• whatarethefeaturetemplatesandweights?
33
HMM:
linearmodel:
HMMasaLinearModel
featuretemplatesandweights:
34
HMM:
linearmodel:
LinearSequenceModels
• so,anHMMis:– alinearsequencemodel– withparticularfeaturesonlabeltransitionsandlabel-observationemissions
– andusesmaximumlikelihoodestimation(count&normalize)forlearning
• butwecoulduseanyfeaturefunctionswelike,anduseanyofourlossfunctionsforlearning!
35
(Chain)ConditionalRandomFields
36
(Chain)ConditionalRandomFields
• linearsequencemodel• arbitraryfeaturesofinputarepermitted• test-timeinferenceusesViterbiAlgorithm• learningdonebyminimizinglogloss(DPalgorithmsusedtocomputegradients)
37
Max-MarginMarkovNetworks
38
Maximum-MarginMarkovNetworks
• linearsequencemodel• arbitraryfeaturesofinputarepermitted• test-timeinferenceusesViterbiAlgorithm• learningdonebyminimizinghingeloss(DPalgorithmusedtocomputesubgradients)
39
FeatureLocality
• featurelocality:roughly,how“big”areyourfeatures?
• whendesigningefficientinferencealgorithms(whetherw/DPorothermethods),weneedtobemindfulofthis
• featurescanbearbitrarilybigintermsoftheinputsequence
• butfeaturescannot bearbitrarilybigintermsoftheoutput sequence!
• thefeaturesinHMMsaresmallinboththeinputandoutputsequences(onlytwopiecesatatime)
40
Arethesefeaturesbigorsmall?
41
feature bigorsmall?
featurethatcountsinstancesof“the”intheinputsentence small
featurethatreturns squarerootofsumofcountsofam/is/was/were small
featurethatcounts“verb verb”sequences small
featurethatcounts“determiner noun verb verb”sequences prettybig!
featurethatcountsthenumberofnounsinasentence
big,butwecandesignspecialized
algorithmstohandlethemifthey’retheonlybigfeatures
feature thatreturnstheratioofnounstoverbs
Learningwithlinearsequencemodels• givenalinearsequencemodelwith“small”features,howshouldwedolearning?
42
Lossfunctionsforlearninglinearsequencemodels
43
loss entryj of(sub)gradientofloss forlinearmodel
perceptron
hinge
log
samegradients/subgradients asbefore,thoughcomputingtheseterms(inference)requiresDP
algorithms
ImplementingDPalgorithms
• startwithcountingmode,butkeepinmindhowthemodel’sscorefunctiondecomposesacrosspartsoftheoutputs– i.e.,how“large”arethefeatures?howmanyitemsintheoutputsequenceareneededtocomputeeachfeature?
– defineafunctioncalledpartScore thatcomputesallthefeatures(forcountingmode,thisfunctionwillreturn1)
44
NeuralNetworksinNLP• neuralnetworks• deepneuralnetworks• neurallanguagemodels• recurrentneuralnetworksandLSTMs• convolutionalneuralnetworks
45
Whatisaneuralnetwork?• justthinkofaneuralnetworkasafunction• ithasinputsandoutputs• theterm“neural”typicallymeansaparticulartypeoffunctionalbuildingblock(“neurallayers”),butthetermhasexpandedtomeanmanythings
46