algorithms for nlpdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfincreasing...

LanguageModelingI

AlgorithmsforNLP

TaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

TheNoisy-ChannelModel

§ Wewanttopredictasentencegivenacoustics:

§ Thenoisy-channelapproach:

Acousticmodel:HMMsoverwordpositionswithmixturesofGaussiansasemissions

Languagemodel:Distributionsoversequencesofwords

(sentences)

sourceP(w) w a

decoderobserved

argmaxP(w|a)=argmaxP(a|w)P(w)w w

w abest

channelP(a|w)

LanguageModel AcousticModel

ASRComponents

AcousticConfusions

thestationsignsareindeepinenglish -14732thestationssignsareindeepinenglish -14735thestationsignsareindeepintoenglish -14739thestation'ssignsareindeepinenglish -14740thestationsignsareindeepintheenglish -14741thestationsignsareindeedinenglish -14757thestation'ssignsareindeedinenglish -14760thestationsignsareindians inenglish -14790thestationsignsareindian inenglish -14799thestationssignsareindians inenglish -14807thestationssignsareindians andenglish -14815

Translation:Codebreaking?

“Also knowing nothing official about, but having guessedand inferred considerable about, the powerful newmechanized methods in cryptography—methods which Ibelieve succeed even when one does not know whatlanguage has been coded—one naturally wonders if theproblem of translation could conceivably be treated as aproblem in cryptography. When I look at an article inRussian, I say: ‘This is really written in English, but it hasbeen coded in some strange symbols. I will now proceed todecode.’ ”

WarrenWeaver(1947)

sourceP(e) e f

decoderobserved

argmaxP(e|f)=argmaxP(f|e)P(e)e e

e fbest

channelP(f|e)

Language Model Translation Model

MTSystemComponents

OtherNoisyChannelModels?

§ We’renotdoingthisonlyforASR(andMT)§ Grammar/spellingcorrection§ Handwritingrecognition,OCR§ Documentsummarization§ Dialoggeneration§ Linguisticdecipherment§ …

Language Models§ Alanguagemodelisadistributionoversequencesofwords

(sentences)

§ What’sw?(closedvsopenvocabulary)§ What’sn?(mustsumtooneoveralllengths)§ Canhaverichstructureorbelinguisticallynaive

§ Whylanguagemodels?§ Usuallythepointistoassignhighweightstoplausiblesentences(cf

acousticconfusions)§ Thisisnotthesameasmodelinggrammaticality

𝑃 𝑤 = 𝑃 𝑤$ …𝑤&

N-GramModels

N-Gram Models§ Usechainruletogeneratewordsleft-to-right

§ Can’tconditionontheentireleftcontext

§ N-grammodelsmakeaMarkovassumption

P(???|Turntopage134andlookatthepictureofthe)

Empirical N-Grams§ HowdoweknowP(w|history)?

§ Usestatisticsfromdata(examplesusingGoogleN-Grams)§ E.g.whatisP(door|the)?

§ Thisisthemaximumlikelihoodestimate

198015222 the first194623024 the same168504105 the following158562063 the world…14112454 the door-----------------23135851162 the *

IncreasingN-Gram Order§ Higherorderscapturemoredependencies

198015222 the first194623024 the same168504105 the following158562063 the world…14112454 the door-----------------23135851162 the *

197302 close the window 191125 close the door 152500 close the gap 116451 close the thread 87298 close the deal

-----------------3785230 close the *

Bigram Model Trigram Model

P(door | the) = 0.0006 P(door | close the) = 0.05

IncreasingN-GramOrder

Sparsity

3380 please close the door1601 please close the window1164 please close the new1159 please close the gate…0 please close the first-----------------13951 please close the *

Pleaseclosethefirstdoorontheleft.

Sparsity§ Problemswithn-grammodels:

§ Newwords(openvocabulary)§ Synaptitute§ 132,701.03§ multidisciplinarization

§ Oldwordsinnewcontexts

§ Aside:Zipf’s Law§ Types(words)vs.tokens(wordoccurences)§ Broadly:mostwordtypesarerareones§ Specifically:

§ Rankwordtypesbytokenfrequency§ Frequencyinverselyproportionaltorank

§ Notspecialtolanguage:randomlygeneratedcharacterstringshavethisproperty(tryit!)

§ Thislawqualitatively(butrarelyquantitatively)informsNLP

0 500000 1000000

Number of Words

Unigrams

Bigrams

N-GramEstimation

Smoothing§ Weoftenwanttomakeestimatesfromsparsestatistics:

§ Smoothingflattensspikydistributionssotheygeneralizebetter:

§ VeryimportantalloverNLP,buteasytodobadly

P(w | denied the)3 allegations2 reports1 claims1 request7 total

P(w | denied the)2.5 allegations1.5 reports0.5 claims0.5 request2 other7 total

LikelihoodandPerplexity§ HowdowemeasureLM“goodness”?

§ Shannon’sgame:predictthenextword

WhenIeatpizza,Iwipeoffthe_________

§ Formally:definetestset(log)likelihood

§ Perplexity:“averageperwordbranchingfactor”

grease 0.5

sauce 0.4

dust 0.05

mice 0.0001

the 1e-100

3516 wipe off the excess 1034 wipe off the dust547 wipe off the sweat518 wipe off the mouthpiece…120 wipe off the grease0 wipe off the sauce0 wipe off the mice-----------------28048 wipe off the *

logP (X|✓) =X

logP (w|✓)

perp(X, ✓) = exp

✓� logP (X|✓)

MeasuringModelQuality(Speech)

§ WereallywantbetterASR(orwhatever),notbetterperplexities

§ Forspeech,wecareaboutworderrorrate(WER)

§ Commonissue:intrinsicmeasureslikeperplexityareeasiertouse,butextrinsiconesaremorecredible

Correctanswer: Andysawa part ofthemovie

Recognizeroutput: And he sawapart ofthemovie

insertions +deletions +substitutionstruesentencesizeWER: =4/7=57%

KeyIdeasforN-GramLMs

Idea1:Interpolation

Pleaseclosethefirstdoorontheleft.

3380 please close the door1601 please close the window1164 please close the new1159 please close the gate…0 please close the first-----------------13951 please close the *

198015222 the first194623024 the same168504105 the following158562063 the world……-----------------23135851162 the *

197302 close the window 191125 close the door 152500 close the gap 116451 close the thread…8662 close the first-----------------3785230 close the *

0.0 0.002 0.009

SpecificbutSparse DensebutGeneral

4-Gram 3-Gram 2-Gram

(Linear)Interpolation§ Simplestwaytomixdifferentorders:linearinterpolation

§ Howtochooselambdas?§ Shouldlambdadependonthecountsofthehistories?

§ Choosingweights:eithergridsearchorEMusingheld-outdata

§ Bettermethodshaveinterpolationweightsconnectedtocontextcounts,soyousmoothmorewhenyouknowless

Train,Held-Out,Test§ Wanttomaximizelikelihoodontest,nottrainingdata

§ Empiricaln-gramswon’tgeneralizewell§ Modelsderivedfromcounts/sufficientstatisticsrequire

generalizationparameterstobetunedonheld-outdatatosimulatetestgeneralization

§ Sethyperparameters tomaximizethelikelihoodoftheheld-outdata(usuallywithgridsearchorEM)

Training Data Held-OutData

TestData

Counts / parameters from here

Hyperparametersfrom here

Evaluate here

Idea2:Discounting§ Observation:N-gramsoccurmoreintrainingdatathanthey

willlater

Countin22MWords Futurec*(Next22M)1 0.452 1.253 2.244 3.235 4.21

EmpiricalBigramCounts(ChurchandGale,91)

Absolute Discounting

§ Absolutediscounting§ Reducenumeratorcountsbyaconstantd(e.g.0.75)§ Maybehaveaspecialdiscountforsmallcounts§ Redistributethe“shaved”masstoamodelofnewevents

§ Exampleformulation

Idea3:Fertility

§ Shannongame:“Therewasanunexpected_____”§ “delay”?§ “Francisco”?

§ Contextfertility:numberofdistinctcontexttypesthatawordoccursin§ Whatisthefertilityof“delay”?§ Whatisthefertilityof“Francisco”?§ Whichismorelikelyinanarbitrarynewcontext?

Kneser-Ney Smoothing§ Kneser-Neysmoothingcombinestwoideas

§ Discountandreallocatelikeabsolutediscounting§ Inthebackoff model,wordprobabilitiesareproportionaltocontextfertility,notfrequency

§ Theoryandpractice§ Practice:KNsmoothinghasbeenrepeatedlyprovenbotheffectiveandefficient

§ Theory:KNsmoothingasapproximateinferenceinahierarchicalPitman-Yor process[Teh,2006]

P (w) / |{w0 : c(w0, w) > 0}|

Kneser-NeyDetails§ Allordersrecursivelydiscountandback-off:

§ Alphaiscomputedtomaketheprobabilitynormalize(seeifyoucanfigureoutanexpression).

§ Forthehighestorder,c’isthetokencountofthen-gram.Forallothersitisthecontextfertilityofthen-gram:

§ Theunigrambasecasedoesnotneedtodiscount.§ Variantsarepossible(e.g.differentdforlowcounts)

0(x) = |{u : c(u, x) > 0}|

Pk(w|prevk�1) =max(c0(prevk�1, w)� d, 0)P

v c0(prevk�1, v)

+ ↵(prev k � 1)Pk�1(w|prevk�2)

WhatActuallyWorks?§ Trigramsandbeyond:

§ Unigrams,bigramsgenerallyuseless

§ Trigramsmuchbetter§ 4-,5-gramsandmoreare

reallyusefulinMT,butgainsaremorelimitedforspeech

§ Discounting§ Absolutediscounting,Good-

Turing,held-outestimation,Witten-Bell,etc…

§ Contextcounting§ Kneser-Neyconstructionof

lower-ordermodels

§ See[Chen+Goodman]readingfortonsofgraphs…

[Graph fromJoshua Goodman]

Idea4:BigData

There’snodatalikemoredata.

Data>>Method?§ Havingmoredataisbetter…

§ …butsoisusingabetterestimator§ Anotherissue:N>3hashugecostsinspeechrecognizers

1 2 3 4 5 6 7 8 9 10 20n-gram order

100,000 Katz

100,000 KN

1,000,000 Katz

1,000,000 KN

10,000,000 Katz

10,000,000 KN

all Katz

all KN

TonsofData?

[Brants etal,2007]

What about…

Unknown Words?§ Whatabouttotallyunseenwords?

§ MostLMapplicationsareclosedvocabulary§ ASRsystemswillonlyproposewordsthatareintheirpronunciation

dictionary§ MTsystemswillonlyproposewordsthatareintheirphrasetables

(modulospecialmodelsfornumbers,etc)

§ Inprinciple,onecanbuildopenvocabularyLMs§ E.g.modelsovercharactersequencesratherthanwordsequences§ Back-offneedstogodownintoa“generatenewword”model§ Typicallyifyouneedthis,ahigh-ordercharactermodelwilldo

What’sinanN-Gram?§ Justabouteverylocalcorrelation!

§ Wordclassrestrictions:“willhavebeen___”§ Morphology:“she___”,“they___”§ Semanticclassrestrictions:“dancedthe___”§ Idioms:“addinsultto___”§ Worldknowledge:“icecapshave___”§ Popculture:“theempirestrikes___”

§ Butnotthelong-distanceones§ “Thecomputer whichIhadjustputintothemachineroomonthefifthfloor___.”

LinguisticPain?§ TheN-Gramassumptionhurtsone’sinnerlinguist!

§ Manylinguisticargumentsthatlanguageisn’tregular§ Long-distancedependencies§ Recursivestructure

§ Answers§ N-gramsonlymodellocalcorrelations,buttheygetthemall§ AsNincreases,theycatchevenmorecorrelations§ N-grammodelsscalemuchmoreeasilythanstructuredLMs

§ Notconvinced?§ CanbuildLMsoutofourgrammarmodels(laterinthecourse)§ Takeanygenerativemodelwithwordsatthebottomandmarginalize

outtheothervariables

WhatGetsCaptured?

§ Bigrammodel:§ [texaco,rose,one,in,this,issue,is,pursuing,growth,in,a,boiler,

house,said,mr.,gurria,mexico,'s,motion,control,proposal,without,permission,from,five,hundred,fifty,five,yen]

§ [outside,new,car,parking,lot,of,the,agreement,reached]§ [this,would,be,a,record,november]

§ PCFGmodel:§ [This,quarter,‘s,surprisingly,independent,attack,paid,off,the,risk,

involving,IRS,leaders,and,transportation,prices,.]§ [It,could,be,announced,sometime,.]§ [Mr.,Toseland,believes,the,average,defense,economy,is,drafted,

from,slightly,more,than,12,stocks,.]

ScalingUp?§ There’salotoftrainingdataoutthere…

…nextclasswe’lltalkabouthowtomakeitfit.

OtherTechniques?§ Lotsofothertechniques

§ MaximumentropyLMs(soon)

§ NeuralnetworkLMs(soon)

§ Syntactic/grammar-structuredLMs(muchlater)

algorithms for nlpdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfincreasing...

Documents

junior secondary level physics -...

printable fabrication of nanocoral‐structured …...

“b” grade accredited bynaac with...

barriers to dissemination of antibiotic resistance genes...

sprott planning & environment - increasing investment...

11-830 computational ethics for...

aboveground live carbon storage in woody agroforestry...

semantics so far in course - carnegie mellon...

algorithms for nlp - carnegie mellon...

apache2 ubuntu default page: it...

increasing productivity of tef: new approaches with...

increasing firm value through detection...

algorithms for nlpdemo.clab.cs.cmu.edu/11711fa18/slides/fa18...

finite state...

algorithms for...

open increasing aridity, temperature and soil ph...

algorithms for...

ethical challenges in...

algorithms for nlp - carnegie mellon...

algorithms for...