ttic 31190: natural language processingkgimpel/teaching/31190/... · 2016. 2. 2. · • predicting...

TTIC31190:NaturalLanguageProcessing

KevinGimpelWinter2016

Lecture9:SequenceModels

1

Announcements• onThursday,classwillbeinRoom530(theroomdirectlybehindyou)

2

Announcements• wewillgooverpartofAssignment1today(gradescomingsoon)

• Assignment2wasdueWed.Feb.3,nowdueFri.,Feb.5

• projectproposaldueTuesday,Feb.16• midtermonThursday,Feb.18

3

• qualityofscientificjournalism:

4

OtherNaturally-OccurringData

OtherNaturally-OccurringData• memorabilityofquotations:

5

OtherNaturally-OccurringData• sarcasm(remove#sarcasmhashtagfromtweets):

6

OtherNaturally-OccurringData• openingweekendmovierevenuepredictionfromcriticreviews:

7

OtherNaturally-OccurringData• predictingnovelsuccessfromtextofnovels:

8

ProjectProposal• dueFeb.16(intwoweeks)• 1-2pages• onepergroup• includethefollowing:– membersofyourgroup– describethetaskyouaregoingtoworkon(couldbeanewtaskyoucreateoranexistingtask)

– describethemethodsyouwilluse/developforthetask– giveabriefreviewofrelatedwork;i.e.,situateyourprojectwithrespecttotheliterature(www.aclweb.organdGoogleScholarareusefulforthis!)

– aproposedtimeline

9

ProjectProposal(cont’d)

• yourresultsdonothavetobeatthestate-of-the-art!

• butyourprojectdoeshavetobecarefullydone,sothatyoucandrawconclusions

• youarewelcometostartbyreplicatinganNLPpaper(Icangivesuggestionsifyouneedsome)

• duringtheweekofFeb.22,pleasescheduleameetingwithmetodiscussyourproject– detailstofollow

10

ClassPresentations• finaltwoclassmeetings(March3rd andMarch8th)willbemostlyusedforin-classpresentations

• onepresentationpergroup• 10-15minutesperpresentation(willbedeterminedonceIknowhowmanygroupsthereare)

• youwilleachtakenotesandemailmequestions/feedbackforthepresenter,whichIwillanonymize andsend

11

Project• finalreportdueThursday,March17(originaldateofthefinalexam)

• sothepresentationwillbemorelikean“interimprogressreport”

12

Roadmap• classification• words• lexicalsemantics• languagemodeling• sequencelabeling• neuralnetworkmethodsinNLP• syntaxandsyntacticparsing• semanticcompositionality• semanticparsing• unsupervisedlearning• machinetranslationandotherapplications

13

determinerverb(past)prep.properproperposs.adj.noun

modalverbdet.adjectivenounprep.properpunc.

14

Part-of-SpeechTagging

determinerverb(past)prep.nounnounposs.adj.nounSomequestionedifTimCook’sfirstproduct

modalverbdet.adjectivenounprep.nounpunc.wouldbeabreakawayhitforApple.

Simplestkindofstructuredprediction:SequenceLabeling

15

OOOB-PERSONI-PERSONOOOSomequestionedifTimCook’sfirstproduct

OOOOOOB-ORGANIZATIONOwouldbeabreakawayhitforApple.

NamedEntityRecognition

B=“begin”I=“inside”O=“outside”

FormulatingsegmentationtasksassequencelabelingviaB-I-Olabeling:

• therearemanydownloadablepart-of-speechtaggersandnamedentityrecognizers:– StanfordPOStagger,NERlabeler– TurboTagger,TurboEntityRecognizer– IllinoisEntityTagger– CMUTwitterPOStagger– AlanRitter’sTwitterPOS/NERlabeler

16

HiddenMarkovModels

18

y1 y2 y3 y4

x1 x2 x3 x4

transitionparameters:

emissionparameters:

HMMsforWordClustering(Brownetal.,1992)

19

eachisaclusterIDso,labelspaceis

justin bieber forpresident

y1 y2 y3 y4

HMMsforPart-of-SpeechTagging

20

eachisapart-of-speechtagso,labelspaceis

whatparametersneedtobelearned?


emissionparameters:


propernoun

propernoun

prepo-sition

noun

HowshouldwelearntheHMMparameters?

21


emissionparameters:

SupervisedHMMs• givenadatasetofinputsequencesandannotatedoutputs:

• toestimatetransition/emissiondistributions,usemaximumlikelihoodestimation(countandnormalize):

22


propernoun

propernoun

prepo-sition

noun

EstimatesofTagTransitionProbabilities

23

proper modalinfinitive adjectivenoun adverbdeterminernoun verbverb

EstimatesofEmissionProbabilities

24

InferenceinHMMs

25

• sincetheoutputisasequence,thisargmaxrequiresiteratingoveranexponentially-largeset

• lastweekwetalkedaboutusingdynamicprogramming(DP)tosolvetheseproblems

• forHMMs(andothersequencemodels),theforsolvingthisiscalledtheViterbialgorithm

ViterbiAlgorithm• recursiveequations+memoization:

26

basecase:returnsprobabilityofsequencestartingwithlabely forfirstword

recursivecase:computesprobabilityofmax-probabilitylabelsequencethatendswithlabely atpositionm

finalvalueisin:

Example:

Janetwillbackthebill

27

proper modalinfinitive determiner nounnoun verbverb

Janetwillbackthebill

28

proper modalinfinitive determiner nounnoun verbverb

ViterbiAlgorithm(onboard)

29

ViterbiAlgorithm• spaceandtimecomplexity?• canbereadofffromtherecursiveequations:

30

spacecomplexity:sizeofmemoization table,whichis#ofuniqueindicesofrecursiveequations

so,spacecomplexityisO(|x||L|)

lengthofsentence

numberoflabels*

ViterbiAlgorithm• spaceandtimecomplexity?• canbereadofffromtherecursiveequations:

31

timecomplexity:sizeofmemoization table*complexityofcomputingeachentry

so,timecomplexityisO(|x||L||L|)=O(|x||L|2)

lengthofsentence

numberoflabels*

eachentryrequiresiteratingthroughthelabels*

LinearSequenceModels

• let’sgeneralizeHMMsandtalkaboutlinearmodelsforscoringlabelsequencesinourclassifierframework:

• butfirst,howdoweknowthatthisscoringfunctiongeneralizesHMMs?

32

HMMasaLinearModel

• whatarethefeaturetemplatesandweights?

33

HMM:

linearmodel:

HMMasaLinearModel

featuretemplatesandweights:

34

HMM:

linearmodel:

LinearSequenceModels

• so,anHMMis:– alinearsequencemodel– withparticularfeaturesonlabeltransitionsandlabel-observationemissions

– andusesmaximumlikelihoodestimation(count&normalize)forlearning

• butwecoulduseanyfeaturefunctionswelike,anduseanyofourlossfunctionsforlearning!

35

(Chain)ConditionalRandomFields

36

(Chain)ConditionalRandomFields

• linearsequencemodel• arbitraryfeaturesofinputarepermitted• test-timeinferenceusesViterbiAlgorithm• learningdonebyminimizinglogloss(DPalgorithmsusedtocomputegradients)

37

Max-MarginMarkovNetworks

38

Maximum-MarginMarkovNetworks

• linearsequencemodel• arbitraryfeaturesofinputarepermitted• test-timeinferenceusesViterbiAlgorithm• learningdonebyminimizinghingeloss(DPalgorithmusedtocomputesubgradients)

39

FeatureLocality

• featurelocality:roughly,how“big”areyourfeatures?

• whendesigningefficientinferencealgorithms(whetherw/DPorothermethods),weneedtobemindfulofthis

• featurescanbearbitrarilybigintermsoftheinputsequence

• butfeaturescannot bearbitrarilybigintermsoftheoutput sequence!

• thefeaturesinHMMsaresmallinboththeinputandoutputsequences(onlytwopiecesatatime)

40

Arethesefeaturesbigorsmall?

41

feature bigorsmall?

featurethatcountsinstancesof“the”intheinputsentence small

featurethatreturns squarerootofsumofcountsofam/is/was/were small

featurethatcounts“verb verb”sequences small

featurethatcounts“determiner noun verb verb”sequences prettybig!

featurethatcountsthenumberofnounsinasentence

big,butwecandesignspecialized

algorithmstohandlethemifthey’retheonlybigfeatures

feature thatreturnstheratioofnounstoverbs

Learningwithlinearsequencemodels• givenalinearsequencemodelwith“small”features,howshouldwedolearning?

42

Lossfunctionsforlearninglinearsequencemodels

43

loss entryj of(sub)gradientofloss forlinearmodel

perceptron

hinge

log

samegradients/subgradients asbefore,thoughcomputingtheseterms(inference)requiresDP

algorithms

ImplementingDPalgorithms

• startwithcountingmode,butkeepinmindhowthemodel’sscorefunctiondecomposesacrosspartsoftheoutputs– i.e.,how“large”arethefeatures?howmanyitemsintheoutputsequenceareneededtocomputeeachfeature?

– defineafunctioncalledpartScore thatcomputesallthefeatures(forcountingmode,thisfunctionwillreturn1)

44

NeuralNetworksinNLP• neuralnetworks• deepneuralnetworks• neurallanguagemodels• recurrentneuralnetworksandLSTMs• convolutionalneuralnetworks

45

Whatisaneuralnetwork?• justthinkofaneuralnetworkasafunction• ithasinputsandoutputs• theterm“neural”typicallymeansaparticulartypeoffunctionalbuildingblock(“neurallayers”),butthetermhasexpandedtomeanmanythings

46

ttic 31190: natural language processingkgimpel/teaching/31190/... · 2016. 2. 2. · • predicting...

Documents