introducon to informa)on retrieval the course thus...
TRANSCRIPT
1
Introduc)ontoInforma)onRetrieval
Introduc*onto
Informa(onRetrieval
CS276:Informa*onRetrievalandWebSearchChristopherManningandPanduNayak
SpellingCorrec*on
Introduc)ontoInforma)onRetrieval
Thecoursethusfar…Indexconstruc*onIndexcompressionEfficientbooleanquerying
Chapters1,2,4,5Courseralectures1,2,3,4
Spellingcorrec*onChapter3Courseralecture5(mainlysomeparts)Thislecture(PA#2!)
2
Introduc)ontoInforma)onRetrieval
Applica*onsforspellingcorrec*on
3
Websearch
PhonesWordprocessing
Introduc)ontoInforma)onRetrieval
Ratesofspellingerrors
26%: WebqueriesWangetal.2003
13%: Retyping,nobackspace:Whitelawetal.English&German
7%:Wordscorrectedretypingonphone-sizedorganizer2%:WordsuncorrectedonorganizerSoukoreff&MacKenzie20031-2%:Retyping:KaneandWobbrock2007,Grudenetal.1983 4
Dependingontheapplica*on,~1–20%errorrates
Introduc)ontoInforma)onRetrieval
SpellingTasks§ SpellingErrorDetec*on§ SpellingErrorCorrec*on:
§ Autocorrect§ hteàthe
§ Suggestacorrec*on§ Sugges*onlists
5
Introduc)ontoInforma)onRetrieval
Typesofspellingerrors§ Non-wordErrors
§ graffeàgiraffe
§ Real-wordErrors§ Typographicalerrors
§ threeàthere
§ Cogni*veErrors(homophones)§ pieceàpeace,§ tooàtwo§ youràyou’re
§ Non-wordcorrec*onwashistoricallymainlycontextinsensi*ve§ Real-wordcorrec*onalmostneedstobecontextsensi*ve
6
2
Introduc)ontoInforma)onRetrieval
Non-wordspellingerrors§ Non-wordspellingerrordetec*on:
§ Anywordnotinadic$onaryisanerror§ Thelargerthedic*onarythebecer…uptoapoint§ (TheWebisfullofmis-spellings,sotheWebisn’tnecessarilyagreatdic*onary…)
§ Non-wordspellingerrorcorrec*on:§ Generatecandidates:realwordsthataresimilartoerror§ Choosetheonewhichisbest:
§ Shortestweightededitdistance§ Highestnoisychannelprobability
7
Introduc)ontoInforma)onRetrieval
Realword&non-wordspellingerrors§ Foreachwordw,generatecandidateset:
§ Findcandidatewordswithsimilarpronuncia$ons§ Findcandidatewordswithsimilarspellings§ Includewincandidateset
§ Choosebestcandidate§ NoisyChannelviewofspellerrors§ Context-sensi*ve–sohavetoconsiderwhetherthesurroundingwords“makesense”
§ FlyingformHeathrowtoLAXàFlyingfromHeathrowtoLAX
8
Introduc)ontoInforma)onRetrieval
Terminology§ Thesearecharacterbigrams:
§ st,pr,an…§ Thesearewordbigrams:
§ paloalto,flyingfrom,roadrepairs
§ Intoday’sclass,wewillgenerallydealwithwordbigrams
§ IntheaccompanyingCourseralecture,wemostlydealwithcharacterbigrams(becausewecoverstuffcomplementarytowhatwe’rediscussinghere)
9
Similarlytrigrams,
k-gramsetc
Introduc)ontoInforma)onRetrieval
INDEPENDENT WORD SPELLING CORRECTION
TheNoisyChannelModelofSpelling
Introduc)ontoInforma)onRetrieval
NoisyChannelIntui*on
11
Introduc)ontoInforma)onRetrieval
NoisyChannel=Bayes’Rule§ Weseeanobserva*onxofamisspelledword§ Findthecorrectwordŵ
12
w = argmaxw∈V
P(w | x)
= argmaxw∈V
P(x |w)P(w)P(x)
= argmaxw∈V
P(x |w)P(w)
Bayes
3
Introduc)ontoInforma)onRetrieval
History:Noisychannelforspellingproposedaround1990§ IBM
§ Mays,Eric,FredJ.DamerauandRobertL.Mercer.1991.Contextbasedspellingcorrec*on.Informa)onProcessingandManagement,23(5),517–522
§ AT&TBellLabs§ Kernighan,MarkD.,KennethW.Church,andWilliamA.Gale.1990.Aspellingcorrec*onprogrambasedonanoisychannelmodel.ProceedingsofCOLING1990,205-210
Introduc)ontoInforma)onRetrieval
Non-wordspellingerrorexample
acress
14
Introduc)ontoInforma)onRetrieval
Candidategenera*on§ Wordswithsimilarspelling
§ Smalleditdistancetoerror
§ Wordswithsimilarpronuncia*on§ Smalldistanceofpronuncia*ontoerror
§ Inthisclasslecturewemostlywon’tdwellonefficientcandidategenera*on
§ Alotmoreaboutcandidategenera*onintheaccompanyingCourseramaterial
15
Introduc)ontoInforma)onRetrieval
CandidateTes*ng:Damerau-Levenshteineditdistance§ Minimaleditdistancebetweentwostrings,whereeditsare:§ Inser*on§ Dele*on§ Subs*tu*on§ Transposi*onoftwoadjacentlecers
§ SeeIIRsec3.3.3foreditdistance
16
Introduc)ontoInforma)onRetrieval
Wordswithin1ofacressError Candidate
Correc(onCorrectLeDer
ErrorLeDer
Type
acress actress t - dele*on
acress cress - a inser*on
acress caress ca ac transposi*on
acress access c r subs*tu*on
acress across o e subs*tu*on
acress acres - s inser*on 17
Introduc)ontoInforma)onRetrieval
Candidategenera*on§ 80%oferrorsarewithineditdistance1§ Almostallerrorswithineditdistance2
§ Alsoallowinser*onofspaceorhyphen§ thisidea àthis idea§ inlaw à in-law
§ Canalsoallowmergingwords§ data base àdatabase§ Forshorttextslikeaquery,canjustregardwholestringasoneitemfromwhichtoproduceedits
18
4
Introduc)ontoInforma)onRetrieval
Howdoyougeneratethecandidates?1. Runthroughdic*onary,checkeditdistancewitheach
word2. Generateallwordswithineditdistance≤k(e.g.,k=1
or2)andthenintersectthemwithdic*onary3. Useacharacterk-gramindexandfinddic*onary
wordsthatshare“most”k-gramswithword(e.g.,byJaccardcoefficient)§ seeIIRsec3.3.4
4. ComputethemfastwithaLevenshteinfinitestatetransducer
5. Haveaprecomputedmapofwordstopossiblecorrec*ons 19
Introduc)ontoInforma)onRetrieval
Aparadigm…§ Wewantthebestspellcorrec*ons§ Insteadoffindingtheverybest,we
§ Findasubsetofprecygoodcorrec*ons§ (say,editdistanceatmost2)
§ Findthebestamongstthem
§ Thesemaynotbetheactualbest§ ThisisarecurringparadigminIRincludingfindingthebestdocsforaquery,bestanswers,bestads…§ Findagoodcandidateset§ FindthetopKamongstthemandreturnthemasthebest
20
Introduc)ontoInforma)onRetrieval
Let’ssaywe’vegeneratedcandidates:NowbacktoBayes’Rule§ Weseeanobserva*onxofamisspelledword§ Findthecorrectwordŵ
21
w = argmaxw∈V
P(w | x)
= argmaxw∈V
P(x |w)P(w)P(x)
= argmaxw∈V
P(x |w)P(w) What’sP(w)?
Introduc)ontoInforma)onRetrieval
LanguageModel§ Takeabigsupplyofwords(yourdocumentcollec*onwithTtokens);letC(w)=#occurrencesofw
§ Inotherapplica*ons–youcantakethesupplytobetypedqueries(suitablyfiltered)–whenasta*cdic*onaryisinadequate
22
P(w) = C(w)T
Introduc)ontoInforma)onRetrieval
UnigramPriorprobability
word Frequencyofword
P(w)
actress 9,321 .0000230573
cress 220 .0000005442
caress 686 .0000016969
access 37,038 .0000916207
across 120,844 .0002989314
acres 12,874 .000031846323
Countsfrom404,253,213wordsinCorpusofContemporaryEnglish(COCA)
Introduc)ontoInforma)onRetrieval
Channelmodelprobability§ Errormodelprobability,Editprobability§ Kernighan,Church,Gale1990
§ Misspelledwordx=x1,x2,x3…xm
§ Correctwordw=w1,w2,w3,…,wn
§ P(x|w)=probabilityoftheedit§ (dele*on/inser*on/subs*tu*on/transposi*on)
24
5
Introduc)ontoInforma)onRetrieval
Compu*ngerrorprobability:confusion“matrix”del[x,y]: count(xy typed as x)ins[x,y]: count(x typed as xy)sub[x,y]: count(y typed as x)trans[x,y]: count(xy typed as yx)
Inser*onanddele*oncondi*onedonpreviouscharacter
25
Introduc)ontoInforma)onRetrieval
Confusionmatrixforsubs*tu*on
Introduc)ontoInforma)onRetrieval
Nearbykeys
Introduc)ontoInforma)onRetrieval
Genera*ngtheconfusionmatrix§ PeterNorvig’slistoferrors§ PeterNorvig’slistofcountsofsingle-editerrors
§ AllPeterNorvig’sngramsdatalinks:hcp://norvig.com/ngrams/
28
Introduc)ontoInforma)onRetrieval
Channelmodel
29
P (x|w) =
8>>>>>>>><
>>>>>>>>:
del[wi�1,wi]count[wi�1wi]
, if deletion
ins[wi�1,xi]count[wi�1]
, if insertion
sub[xi,wi]count[wi]
, if substitution
trans[wi,wi+1]count[wiwi+1]
, if transposition
Kernighan,Church,Gale1990
Introduc)ontoInforma)onRetrieval
Smoothingprobabili*es:Add-1smoothing§ Butifweusetheconfusionmatrixexample,unseenerrorsareimpossible!
§ They’llmaketheoverallprobability0.Thatseemstooharsh§ e.g.,inKernighan’schartqèaandaèqareboth0,eventhoughthey’readjacentonthekeyboard!
§ Asimplesolu*onistoadd1toallcountsandthenifthereisa|A|characteralphabet,tonormalizeappropriately:
30
If substitution, P(x |w) = sub[x,w]+1count[w]+ A
6
Introduc)ontoInforma)onRetrieval
ChannelmodelforacressCandidateCorrec(on
CorrectLeDer
ErrorLeDer
x|w P(x|w)
actress t - c|ct .000117
cress - a a|# .00000144
caress ca ac ac|ca .00000164
access c r r|c .000000209
across o e e|o .0000093
acres - s es|e .0000321
acres - s ss|s .0000342 31
Introduc)ontoInforma)onRetrieval
NoisychannelprobabilityforacressCandidateCorrec(on
CorrectLeDer
ErrorLeDer
x|w P(x|w) P(w) 109*P(x|w)*P(w)
actress t - c|ct .000117 .0000231 2.7
cress - a a|# .00000144 .000000544 .00078
caress ca ac ac|ca
.00000164 .00000170 .0028
access c r r|c .000000209 .0000916 .019
across o e e|o .0000093 .000299 2.8
acres - s es|e .0000321 .0000318 1.0
acres - s ss|s .0000342 .0000318 1.032
Introduc)ontoInforma)onRetrieval
NoisychannelprobabilityforacressCandidateCorrec(on
CorrectLeDer
ErrorLeDer
x|w P(x|w) P(w) 109*P(x|w)P(w)
actress t - c|ct
.000117 .0000231 2.7
cress - a a|# .00000144 .000000544 .00078
caress ca ac ac|ca
.00000164 .00000170 .0028
access c r r|c .000000209 .0000916 .019
across o e e|o .0000093 .000299 2.8
acres - s es|e
.0000321 .0000318 1.0
acres - s ss|s
.0000342 .0000318 1.033
Introduc)ontoInforma)onRetrieval
Evalua*on§ Somespellingerrortestsets
§ Wikipedia’slistofcommonEnglishmisspelling§ Aspellfilteredversionofthatlist§ Birkbeckspellingerrorcorpus§ PeterNorvig’slistoferrors(includesWikipediaandBirkbeck,fortrainingortes*ng)
34
Introduc)ontoInforma)onRetrieval
SPELLING CORRECTION WITH THE NOISY CHANNEL
Context-Sensi*veSpellingCorrec*on
Introduc)ontoInforma)onRetrieval
Real-wordspellingerrors
§ …leaving in about fifteen minuets to go to her house.§ The design an construction of the system…§ Can they lave him my messages?§ The study was conducted mainly be John Black.
§ 25-40%ofspellingerrorsarerealwordsKukich1992
36
7
Introduc)ontoInforma)onRetrieval
Context-sensi*vespellingerrorfixing§ Foreachwordinsentence(phrase,query…)
§ Generatecandidateset§ theworditself§ allsingle-lecereditsthatareEnglishwords§ wordsthatarehomophones§ (allofthiscanbepre-computed!)
§ Choosebestcandidates§ Noisychannelmodel
37
Introduc)ontoInforma)onRetrieval
Noisychannelforreal-wordspellcorrec*on
§ Givenasentencew1,w2,w3,…,wn
§ Generateasetofcandidatesforeachwordwi§ Candidate(w1)={w1,w’1,w’’1,w’’’1,…}§ Candidate(w2)={w2,w’2,w’’2,w’’’2,…}§ Candidate(wn)={wn,w’n,w’’n,w’’’n,…}
§ ChoosethesequenceWthatmaximizesP(W)
Introduc)ontoInforma)onRetrieval
Incorpora*ngcontextwords:Context-sensi*vespellingcorrec*on
§ Determiningwhetheractressoracrossisappropriatewillrequirelookingatthecontextofuse
§ Wecandothiswithabecerlanguagemodel§ Youlearned/canlearnalotaboutlanguagemodelsinCS124orCS224N
§ Herewepresentjustenoughtobedangerous/dotheassignment
§ Abigramlanguagemodelcondi*onstheprobabilityofawordon(just)thepreviousword
P(w1…wn)=P(w1)P(w2|w1)…P(wn|wn−1)
39
Introduc)ontoInforma)onRetrieval
Incorpora*ngcontextwords§ Forunigramcounts,P(w)isalwaysnon-zero
§ ifourdic*onaryisderivedfromthedocumentcollec*on
§ Thiswon’tbetrueofP(wk|wk−1).Weneedtosmooth§ Wecoulduseadd-1smoothingonthiscondi*onaldistribu*on
§ Buthere’sabecerway–interpolateaunigramandabigram:
Pli(wk|wk−1)=λPuni(wk)+(1−λ)Pbi(wk|wk−1)§ Pbi(wk|wk−1)=C(wk−1,wk)/C(wk−1)
40
Introduc)ontoInforma)onRetrieval
Alltheimportantfinepoints§ Notethatwehaveseveralprobabilitydistribu*onsfor
words§ Keepthemstraight!
§ Youmightwant/needtoworkwithlogprobabili*es:§ logP(w1…wn)=logP(w1)+logP(w2|w1)+…+logP(wn|wn−1)§ Otherwise,beverycarefulaboutfloa*ngpointunderflow
§ Ourquerymaybewordsanywhereinadocument§ We’llstartthebigrames*mateofasequencewithaunigrames*mate
§ O~en,peopleinsteadcondi*ononastart-of-sequencesymbol,butnotgoodhere
§ Becauseofthis,theunigramandbigramcountshavedifferenttotals–notaproblem
41
Introduc)ontoInforma)onRetrieval
Usingabigramlanguagemodel
§ “a stellar and versatile acress whose combination of sass and glamour…”
§ CountsfromtheCorpusofContemporaryAmericanEnglishwithadd-1smoothing
§ P(actress|versatile)=.000021 P(whose|actress) = .0010§ P(across|versatile) =.000021 P(whose|across) = .000006
§ P(“versatile actress whose”) = .000021*.0010 = 210 x10-10§ P(“versatile across whose”) = .000021*.000006 = 1 x10-10
42
8
Introduc)ontoInforma)onRetrieval
Usingabigramlanguagemodel
§ “a stellar and versatile acress whose combination of sass and glamour…”
§ CountsfromtheCorpusofContemporaryAmericanEnglishwithadd-1smoothing
§ P(actress|versatile)=.000021 P(whose|actress) = .0010§ P(across|versatile) =.000021 P(whose|across) = .000006
§ P(“versatile actress whose”) = .000021*.0010 = 210 x10-10§ P(“versatile across whose”) = .000021*.000006 = 1 x10-10
43
Introduc)ontoInforma)onRetrieval
Noisychannelforreal-wordspellcorrec*on
44
two of thew
to threw
on
thawofftao
thetoo
oftwo thaw
...
Introduc)ontoInforma)onRetrieval
Noisychannelforreal-wordspellcorrec*on
45
two of thew
to threw
on
thawofftao
thetoo
oftwo thaw
...
Introduc)ontoInforma)onRetrieval
Simplifica*on:Oneerrorpersentence
§ Outofallpossiblesentenceswithonewordreplaced§ w1,w’’2,w3,w4twooffthew§ w1,w2,w’3,w4twoofthe§ w’’’1,w2,w3,w4tooofthew§ …
§ ChoosethesequenceWthatmaximizesP(W)
Introduc)ontoInforma)onRetrieval
Wheretogettheprobabili*es§ Languagemodel
§ Unigram§ Bigram§ etc.
§ Channelmodel§ Sameasfornon-wordspellingcorrec*on§ Plusneedprobabilityfornoerror,P(w|w)
47
Introduc)ontoInforma)onRetrieval
Probabilityofnoerror§ Whatisthechannelprobabilityforacorrectlytypedword?
§ P(“the”|“the”)§ Ifyouhaveabigcorpus,youcanes*matethispercentcorrect
§ Butthisvaluedependsstronglyontheapplica*on§ .90(1errorin10words)§ .95(1errorin20words)§ .99(1errorin100words)
48
9
Introduc)ontoInforma)onRetrieval
PeterNorvig’s“thew”example
49
x w x|w P(x|w) P(w)109P(x|w)P(w)
thew the ew|e 0.000007 0.02 144
thew thew 0.95 0.00000009 90
thew thaw e|a 0.001 0.0000007 0.7
thew threwh|hr 0.000008 0.000004 0.03
thew thweew|we 0.000003 0.00000004 0.0001
Introduc)ontoInforma)onRetrieval
Stateoftheartnoisychannel
§ Weneverjustmul*plythepriorandtheerrormodel§ Independenceassump*onsàprobabili*esnotcommensurate
§ Instead:Weightthem
§ Learnλfromadevelopmenttestset
50
w = argmaxw∈V
P(x |w)P(w)λ
Introduc)ontoInforma)onRetrieval
Improvementstochannelmodel§ Allowricheredits(BrillandMoore2000)
§ entàant§ phàf§ leàal
§ Incorporatepronuncia*onintochannel(ToutanovaandMoore2002)
§ Incorporatedeviceintochannel§ NotallAndroidphonesneedhavethesameerrormodel§ Butspellcorrec*onmaybedoneatthesystemlevel
51