introducon to informa)on retrieval the course thus...

Post on 07-Jul-2018

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Introduc)ontoInforma)onRetrieval

Introduc*onto

Informa(onRetrieval

CS276:Informa*onRetrievalandWebSearchChristopherManningandPanduNayak

SpellingCorrec*on

Introduc)ontoInforma)onRetrieval

Thecoursethusfar…Indexconstruc*onIndexcompressionEfficientbooleanquerying

Chapters1,2,4,5Courseralectures1,2,3,4

Spellingcorrec*onChapter3Courseralecture5(mainlysomeparts)Thislecture(PA#2!)

2

Introduc)ontoInforma)onRetrieval

Applica*onsforspellingcorrec*on

3

Websearch

PhonesWordprocessing

Introduc)ontoInforma)onRetrieval

Ratesofspellingerrors

26%: WebqueriesWangetal.2003

13%: Retyping,nobackspace:Whitelawetal.English&German

7%:Wordscorrectedretypingonphone-sizedorganizer2%:WordsuncorrectedonorganizerSoukoreff&MacKenzie20031-2%:Retyping:KaneandWobbrock2007,Grudenetal.1983 4

Dependingontheapplica*on,~1–20%errorrates

Introduc)ontoInforma)onRetrieval

SpellingTasks§  SpellingErrorDetec*on§  SpellingErrorCorrec*on:

§  Autocorrect§ hteàthe

§  Suggestacorrec*on§  Sugges*onlists

5

Introduc)ontoInforma)onRetrieval

Typesofspellingerrors§  Non-wordErrors

§  graffeàgiraffe

§  Real-wordErrors§  Typographicalerrors

§  threeàthere

§  Cogni*veErrors(homophones)§  pieceàpeace,§  tooàtwo§  youràyou’re

§  Non-wordcorrec*onwashistoricallymainlycontextinsensi*ve§  Real-wordcorrec*onalmostneedstobecontextsensi*ve

6

2

Introduc)ontoInforma)onRetrieval

Non-wordspellingerrors§  Non-wordspellingerrordetec*on:

§  Anywordnotinadic$onaryisanerror§  Thelargerthedic*onarythebecer…uptoapoint§  (TheWebisfullofmis-spellings,sotheWebisn’tnecessarilyagreatdic*onary…)

§  Non-wordspellingerrorcorrec*on:§  Generatecandidates:realwordsthataresimilartoerror§  Choosetheonewhichisbest:

§  Shortestweightededitdistance§  Highestnoisychannelprobability

7

Introduc)ontoInforma)onRetrieval

Realword&non-wordspellingerrors§  Foreachwordw,generatecandidateset:

§  Findcandidatewordswithsimilarpronuncia$ons§  Findcandidatewordswithsimilarspellings§  Includewincandidateset

§  Choosebestcandidate§  NoisyChannelviewofspellerrors§  Context-sensi*ve–sohavetoconsiderwhetherthesurroundingwords“makesense”

§  FlyingformHeathrowtoLAXàFlyingfromHeathrowtoLAX

8

Introduc)ontoInforma)onRetrieval

Terminology§  Thesearecharacterbigrams:

§  st,pr,an…§  Thesearewordbigrams:

§  paloalto,flyingfrom,roadrepairs

§  Intoday’sclass,wewillgenerallydealwithwordbigrams

§  IntheaccompanyingCourseralecture,wemostlydealwithcharacterbigrams(becausewecoverstuffcomplementarytowhatwe’rediscussinghere)

9

Similarlytrigrams,

k-gramsetc

Introduc)ontoInforma)onRetrieval

INDEPENDENT WORD SPELLING CORRECTION

TheNoisyChannelModelofSpelling

Introduc)ontoInforma)onRetrieval

NoisyChannelIntui*on

11

Introduc)ontoInforma)onRetrieval

NoisyChannel=Bayes’Rule§  Weseeanobserva*onxofamisspelledword§  Findthecorrectwordŵ

12

w = argmaxw∈V

P(w | x)

= argmaxw∈V

P(x |w)P(w)P(x)

= argmaxw∈V

P(x |w)P(w)

Bayes

3

Introduc)ontoInforma)onRetrieval

History:Noisychannelforspellingproposedaround1990§  IBM

§  Mays,Eric,FredJ.DamerauandRobertL.Mercer.1991.Contextbasedspellingcorrec*on.Informa)onProcessingandManagement,23(5),517–522

§  AT&TBellLabs§  Kernighan,MarkD.,KennethW.Church,andWilliamA.Gale.1990.Aspellingcorrec*onprogrambasedonanoisychannelmodel.ProceedingsofCOLING1990,205-210

Introduc)ontoInforma)onRetrieval

Non-wordspellingerrorexample

acress

14

Introduc)ontoInforma)onRetrieval

Candidategenera*on§  Wordswithsimilarspelling

§  Smalleditdistancetoerror

§  Wordswithsimilarpronuncia*on§  Smalldistanceofpronuncia*ontoerror

§  Inthisclasslecturewemostlywon’tdwellonefficientcandidategenera*on

§  Alotmoreaboutcandidategenera*onintheaccompanyingCourseramaterial

15

Introduc)ontoInforma)onRetrieval

CandidateTes*ng:Damerau-Levenshteineditdistance§  Minimaleditdistancebetweentwostrings,whereeditsare:§  Inser*on§  Dele*on§  Subs*tu*on§  Transposi*onoftwoadjacentlecers

§  SeeIIRsec3.3.3foreditdistance

16

Introduc)ontoInforma)onRetrieval

Wordswithin1ofacressError Candidate

Correc(onCorrectLeDer

ErrorLeDer

Type

acress actress t - dele*on

acress cress - a inser*on

acress caress ca ac transposi*on

acress access c r subs*tu*on

acress across o e subs*tu*on

acress acres - s inser*on 17

Introduc)ontoInforma)onRetrieval

Candidategenera*on§  80%oferrorsarewithineditdistance1§  Almostallerrorswithineditdistance2

§  Alsoallowinser*onofspaceorhyphen§  thisidea àthis idea§  inlaw à in-law

§  Canalsoallowmergingwords§  data base àdatabase§  Forshorttextslikeaquery,canjustregardwholestringasoneitemfromwhichtoproduceedits

18

4

Introduc)ontoInforma)onRetrieval

Howdoyougeneratethecandidates?1.  Runthroughdic*onary,checkeditdistancewitheach

word2.  Generateallwordswithineditdistance≤k(e.g.,k=1

or2)andthenintersectthemwithdic*onary3.  Useacharacterk-gramindexandfinddic*onary

wordsthatshare“most”k-gramswithword(e.g.,byJaccardcoefficient)§  seeIIRsec3.3.4

4.  ComputethemfastwithaLevenshteinfinitestatetransducer

5.  Haveaprecomputedmapofwordstopossiblecorrec*ons 19

Introduc)ontoInforma)onRetrieval

Aparadigm…§  Wewantthebestspellcorrec*ons§  Insteadoffindingtheverybest,we

§  Findasubsetofprecygoodcorrec*ons§  (say,editdistanceatmost2)

§  Findthebestamongstthem

§  Thesemaynotbetheactualbest§  ThisisarecurringparadigminIRincludingfindingthebestdocsforaquery,bestanswers,bestads…§  Findagoodcandidateset§  FindthetopKamongstthemandreturnthemasthebest

20

Introduc)ontoInforma)onRetrieval

Let’ssaywe’vegeneratedcandidates:NowbacktoBayes’Rule§  Weseeanobserva*onxofamisspelledword§  Findthecorrectwordŵ

21

w = argmaxw∈V

P(w | x)

= argmaxw∈V

P(x |w)P(w)P(x)

= argmaxw∈V

P(x |w)P(w) What’sP(w)?

Introduc)ontoInforma)onRetrieval

LanguageModel§  Takeabigsupplyofwords(yourdocumentcollec*onwithTtokens);letC(w)=#occurrencesofw

§  Inotherapplica*ons–youcantakethesupplytobetypedqueries(suitablyfiltered)–whenasta*cdic*onaryisinadequate

22

P(w) = C(w)T

Introduc)ontoInforma)onRetrieval

UnigramPriorprobability

word Frequencyofword

P(w)

actress 9,321 .0000230573

cress 220 .0000005442

caress 686 .0000016969

access 37,038 .0000916207

across 120,844 .0002989314

acres 12,874 .000031846323

Countsfrom404,253,213wordsinCorpusofContemporaryEnglish(COCA)

Introduc)ontoInforma)onRetrieval

Channelmodelprobability§  Errormodelprobability,Editprobability§  Kernighan,Church,Gale1990

§  Misspelledwordx=x1,x2,x3…xm

§  Correctwordw=w1,w2,w3,…,wn

§  P(x|w)=probabilityoftheedit§  (dele*on/inser*on/subs*tu*on/transposi*on)

24

5

Introduc)ontoInforma)onRetrieval

Compu*ngerrorprobability:confusion“matrix”del[x,y]: count(xy typed as x)ins[x,y]: count(x typed as xy)sub[x,y]: count(y typed as x)trans[x,y]: count(xy typed as yx)

Inser*onanddele*oncondi*onedonpreviouscharacter

25

Introduc)ontoInforma)onRetrieval

Confusionmatrixforsubs*tu*on

Introduc)ontoInforma)onRetrieval

Nearbykeys

Introduc)ontoInforma)onRetrieval

Genera*ngtheconfusionmatrix§  PeterNorvig’slistoferrors§  PeterNorvig’slistofcountsofsingle-editerrors

§  AllPeterNorvig’sngramsdatalinks:hcp://norvig.com/ngrams/

28

Introduc)ontoInforma)onRetrieval

Channelmodel

29

P (x|w) =

8>>>>>>>><

>>>>>>>>:

del[wi�1,wi]count[wi�1wi]

, if deletion

ins[wi�1,xi]count[wi�1]

, if insertion

sub[xi,wi]count[wi]

, if substitution

trans[wi,wi+1]count[wiwi+1]

, if transposition

Kernighan,Church,Gale1990

Introduc)ontoInforma)onRetrieval

Smoothingprobabili*es:Add-1smoothing§  Butifweusetheconfusionmatrixexample,unseenerrorsareimpossible!

§  They’llmaketheoverallprobability0.Thatseemstooharsh§  e.g.,inKernighan’schartqèaandaèqareboth0,eventhoughthey’readjacentonthekeyboard!

§  Asimplesolu*onistoadd1toallcountsandthenifthereisa|A|characteralphabet,tonormalizeappropriately:

30

If substitution, P(x |w) = sub[x,w]+1count[w]+ A

6

Introduc)ontoInforma)onRetrieval

ChannelmodelforacressCandidateCorrec(on

CorrectLeDer

ErrorLeDer

x|w P(x|w)

actress t - c|ct .000117

cress - a a|# .00000144

caress ca ac ac|ca .00000164

access c r r|c .000000209

across o e e|o .0000093

acres - s es|e .0000321

acres - s ss|s .0000342 31

Introduc)ontoInforma)onRetrieval

NoisychannelprobabilityforacressCandidateCorrec(on

CorrectLeDer

ErrorLeDer

x|w P(x|w) P(w) 109*P(x|w)*P(w)

actress t - c|ct .000117 .0000231 2.7

cress - a a|# .00000144 .000000544 .00078

caress ca ac ac|ca

.00000164 .00000170 .0028

access c r r|c .000000209 .0000916 .019

across o e e|o .0000093 .000299 2.8

acres - s es|e .0000321 .0000318 1.0

acres - s ss|s .0000342 .0000318 1.032

Introduc)ontoInforma)onRetrieval

NoisychannelprobabilityforacressCandidateCorrec(on

CorrectLeDer

ErrorLeDer

x|w P(x|w) P(w) 109*P(x|w)P(w)

actress t - c|ct

.000117 .0000231 2.7

cress - a a|# .00000144 .000000544 .00078

caress ca ac ac|ca

.00000164 .00000170 .0028

access c r r|c .000000209 .0000916 .019

across o e e|o .0000093 .000299 2.8

acres - s es|e

.0000321 .0000318 1.0

acres - s ss|s

.0000342 .0000318 1.033

Introduc)ontoInforma)onRetrieval

Evalua*on§  Somespellingerrortestsets

§  Wikipedia’slistofcommonEnglishmisspelling§  Aspellfilteredversionofthatlist§  Birkbeckspellingerrorcorpus§  PeterNorvig’slistoferrors(includesWikipediaandBirkbeck,fortrainingortes*ng)

34

Introduc)ontoInforma)onRetrieval

SPELLING CORRECTION WITH THE NOISY CHANNEL

Context-Sensi*veSpellingCorrec*on

Introduc)ontoInforma)onRetrieval

Real-wordspellingerrors

§  …leaving in about fifteen minuets to go to her house.§  The design an construction of the system…§  Can they lave him my messages?§  The study was conducted mainly be John Black.

§  25-40%ofspellingerrorsarerealwordsKukich1992

36

7

Introduc)ontoInforma)onRetrieval

Context-sensi*vespellingerrorfixing§  Foreachwordinsentence(phrase,query…)

§  Generatecandidateset§  theworditself§  allsingle-lecereditsthatareEnglishwords§ wordsthatarehomophones§  (allofthiscanbepre-computed!)

§  Choosebestcandidates§ Noisychannelmodel

37

Introduc)ontoInforma)onRetrieval

Noisychannelforreal-wordspellcorrec*on

§  Givenasentencew1,w2,w3,…,wn

§  Generateasetofcandidatesforeachwordwi§  Candidate(w1)={w1,w’1,w’’1,w’’’1,…}§  Candidate(w2)={w2,w’2,w’’2,w’’’2,…}§  Candidate(wn)={wn,w’n,w’’n,w’’’n,…}

§  ChoosethesequenceWthatmaximizesP(W)

Introduc)ontoInforma)onRetrieval

Incorpora*ngcontextwords:Context-sensi*vespellingcorrec*on

§  Determiningwhetheractressoracrossisappropriatewillrequirelookingatthecontextofuse

§  Wecandothiswithabecerlanguagemodel§  Youlearned/canlearnalotaboutlanguagemodelsinCS124orCS224N

§  Herewepresentjustenoughtobedangerous/dotheassignment

§  Abigramlanguagemodelcondi*onstheprobabilityofawordon(just)thepreviousword

P(w1…wn)=P(w1)P(w2|w1)…P(wn|wn−1)

39

Introduc)ontoInforma)onRetrieval

Incorpora*ngcontextwords§  Forunigramcounts,P(w)isalwaysnon-zero

§  ifourdic*onaryisderivedfromthedocumentcollec*on

§  Thiswon’tbetrueofP(wk|wk−1).Weneedtosmooth§  Wecoulduseadd-1smoothingonthiscondi*onaldistribu*on

§  Buthere’sabecerway–interpolateaunigramandabigram:

Pli(wk|wk−1)=λPuni(wk)+(1−λ)Pbi(wk|wk−1)§ Pbi(wk|wk−1)=C(wk−1,wk)/C(wk−1)

40

Introduc)ontoInforma)onRetrieval

Alltheimportantfinepoints§  Notethatwehaveseveralprobabilitydistribu*onsfor

words§  Keepthemstraight!

§  Youmightwant/needtoworkwithlogprobabili*es:§  logP(w1…wn)=logP(w1)+logP(w2|w1)+…+logP(wn|wn−1)§  Otherwise,beverycarefulaboutfloa*ngpointunderflow

§  Ourquerymaybewordsanywhereinadocument§  We’llstartthebigrames*mateofasequencewithaunigrames*mate

§  O~en,peopleinsteadcondi*ononastart-of-sequencesymbol,butnotgoodhere

§  Becauseofthis,theunigramandbigramcountshavedifferenttotals–notaproblem

41

Introduc)ontoInforma)onRetrieval

Usingabigramlanguagemodel

§  “a stellar and versatile acress whose combination of sass and glamour…”

§  CountsfromtheCorpusofContemporaryAmericanEnglishwithadd-1smoothing

§  P(actress|versatile)=.000021 P(whose|actress) = .0010§  P(across|versatile) =.000021 P(whose|across) = .000006

§  P(“versatile actress whose”) = .000021*.0010 = 210 x10-10§  P(“versatile across whose”) = .000021*.000006 = 1 x10-10

42

8

Introduc)ontoInforma)onRetrieval

Usingabigramlanguagemodel

§  “a stellar and versatile acress whose combination of sass and glamour…”

§  CountsfromtheCorpusofContemporaryAmericanEnglishwithadd-1smoothing

§  P(actress|versatile)=.000021 P(whose|actress) = .0010§  P(across|versatile) =.000021 P(whose|across) = .000006

§  P(“versatile actress whose”) = .000021*.0010 = 210 x10-10§  P(“versatile across whose”) = .000021*.000006 = 1 x10-10

43

Introduc)ontoInforma)onRetrieval

Noisychannelforreal-wordspellcorrec*on

44

two of thew

to threw

on

thawofftao

thetoo

oftwo thaw

...

Introduc)ontoInforma)onRetrieval

Noisychannelforreal-wordspellcorrec*on

45

two of thew

to threw

on

thawofftao

thetoo

oftwo thaw

...

Introduc)ontoInforma)onRetrieval

Simplifica*on:Oneerrorpersentence

§  Outofallpossiblesentenceswithonewordreplaced§  w1,w’’2,w3,w4twooffthew§  w1,w2,w’3,w4twoofthe§  w’’’1,w2,w3,w4tooofthew§  …

§  ChoosethesequenceWthatmaximizesP(W)

Introduc)ontoInforma)onRetrieval

Wheretogettheprobabili*es§  Languagemodel

§  Unigram§  Bigram§  etc.

§  Channelmodel§  Sameasfornon-wordspellingcorrec*on§  Plusneedprobabilityfornoerror,P(w|w)

47

Introduc)ontoInforma)onRetrieval

Probabilityofnoerror§  Whatisthechannelprobabilityforacorrectlytypedword?

§  P(“the”|“the”)§  Ifyouhaveabigcorpus,youcanes*matethispercentcorrect

§  Butthisvaluedependsstronglyontheapplica*on§  .90(1errorin10words)§  .95(1errorin20words)§  .99(1errorin100words)

48

9

Introduc)ontoInforma)onRetrieval

PeterNorvig’s“thew”example

49

x w x|w P(x|w) P(w)109P(x|w)P(w)

thew the ew|e 0.000007 0.02 144

thew thew 0.95 0.00000009 90

thew thaw e|a 0.001 0.0000007 0.7

thew threwh|hr 0.000008 0.000004 0.03

thew thweew|we 0.000003 0.00000004 0.0001

Introduc)ontoInforma)onRetrieval

Stateoftheartnoisychannel

§  Weneverjustmul*plythepriorandtheerrormodel§  Independenceassump*onsàprobabili*esnotcommensurate

§  Instead:Weightthem

§  Learnλfromadevelopmenttestset

50

w = argmaxw∈V

P(x |w)P(w)λ

Introduc)ontoInforma)onRetrieval

Improvementstochannelmodel§  Allowricheredits(BrillandMoore2000)

§  entàant§  phàf§  leàal

§  Incorporatepronuncia*onintochannel(ToutanovaandMoore2002)

§  Incorporatedeviceintochannel§  NotallAndroidphonesneedhavethesameerrormodel§  Butspellcorrec*onmaybedoneatthesystemlevel

51

top related