introducon to informa)on retrieval the course thus...

9
1 Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Christopher Manning and Pandu Nayak Spelling Correc*on Introduc)on to Informa)on Retrieval The course thus far … Index construc*on Index compression Efficient boolean querying Chapters 1, 2, 4, 5 Coursera lectures 1, 2, 3, 4 Spelling correc*on Chapter 3 Coursera lecture 5 (mainly some parts) This lecture (PA #2!) 2 Introduc)on to Informa)on Retrieval Applica*ons for spelling correc*on 3 Web search Phones Word processing Introduc)on to Informa)on Retrieval Rates of spelling errors 26%: Web queries Wang et al. 2003 13%: Retyping, no backspace: Whitelaw et al. English&German 7%: Words corrected retyping on phone-sized organizer 2%: Words uncorrected on organizer Soukoreff &MacKenzie 2003 1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983 4 Depending on the applica*on, ~1–20% error rates Introduc)on to Informa)on Retrieval Spelling Tasks § Spelling Error Detec*on § Spelling Error Correc*on: § Autocorrect § hteàthe § Suggest a correc*on § Sugges*on lists 5 Introduc)on to Informa)on Retrieval Types of spelling errors § Non-word Errors § graffe àgiraffe § Real-word Errors § Typographical errors § three àthere § Cogni*ve Errors (homophones) § pieceàpeace, § too à two § your àyou’re § Non-word correc*on was historically mainly context insensi*ve § Real-word correc*on almost needs to be context sensi*ve 6

Upload: hoangcong

Post on 07-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

1

Introduc)ontoInforma)onRetrieval

Introduc*onto

Informa(onRetrieval

CS276:Informa*onRetrievalandWebSearchChristopherManningandPanduNayak

SpellingCorrec*on

Introduc)ontoInforma)onRetrieval

Thecoursethusfar…Indexconstruc*onIndexcompressionEfficientbooleanquerying

Chapters1,2,4,5Courseralectures1,2,3,4

Spellingcorrec*onChapter3Courseralecture5(mainlysomeparts)Thislecture(PA#2!)

2

Introduc)ontoInforma)onRetrieval

Applica*onsforspellingcorrec*on

3

Websearch

PhonesWordprocessing

Introduc)ontoInforma)onRetrieval

Ratesofspellingerrors

26%: WebqueriesWangetal.2003

13%: Retyping,nobackspace:Whitelawetal.English&German

7%:Wordscorrectedretypingonphone-sizedorganizer2%:WordsuncorrectedonorganizerSoukoreff&MacKenzie20031-2%:Retyping:KaneandWobbrock2007,Grudenetal.1983 4

Dependingontheapplica*on,~1–20%errorrates

Introduc)ontoInforma)onRetrieval

SpellingTasks§  SpellingErrorDetec*on§  SpellingErrorCorrec*on:

§  Autocorrect§ hteàthe

§  Suggestacorrec*on§  Sugges*onlists

5

Introduc)ontoInforma)onRetrieval

Typesofspellingerrors§  Non-wordErrors

§  graffeàgiraffe

§  Real-wordErrors§  Typographicalerrors

§  threeàthere

§  Cogni*veErrors(homophones)§  pieceàpeace,§  tooàtwo§  youràyou’re

§  Non-wordcorrec*onwashistoricallymainlycontextinsensi*ve§  Real-wordcorrec*onalmostneedstobecontextsensi*ve

6

2

Introduc)ontoInforma)onRetrieval

Non-wordspellingerrors§  Non-wordspellingerrordetec*on:

§  Anywordnotinadic$onaryisanerror§  Thelargerthedic*onarythebecer…uptoapoint§  (TheWebisfullofmis-spellings,sotheWebisn’tnecessarilyagreatdic*onary…)

§  Non-wordspellingerrorcorrec*on:§  Generatecandidates:realwordsthataresimilartoerror§  Choosetheonewhichisbest:

§  Shortestweightededitdistance§  Highestnoisychannelprobability

7

Introduc)ontoInforma)onRetrieval

Realword&non-wordspellingerrors§  Foreachwordw,generatecandidateset:

§  Findcandidatewordswithsimilarpronuncia$ons§  Findcandidatewordswithsimilarspellings§  Includewincandidateset

§  Choosebestcandidate§  NoisyChannelviewofspellerrors§  Context-sensi*ve–sohavetoconsiderwhetherthesurroundingwords“makesense”

§  FlyingformHeathrowtoLAXàFlyingfromHeathrowtoLAX

8

Introduc)ontoInforma)onRetrieval

Terminology§  Thesearecharacterbigrams:

§  st,pr,an…§  Thesearewordbigrams:

§  paloalto,flyingfrom,roadrepairs

§  Intoday’sclass,wewillgenerallydealwithwordbigrams

§  IntheaccompanyingCourseralecture,wemostlydealwithcharacterbigrams(becausewecoverstuffcomplementarytowhatwe’rediscussinghere)

9

Similarlytrigrams,

k-gramsetc

Introduc)ontoInforma)onRetrieval

INDEPENDENT WORD SPELLING CORRECTION

TheNoisyChannelModelofSpelling

Introduc)ontoInforma)onRetrieval

NoisyChannelIntui*on

11

Introduc)ontoInforma)onRetrieval

NoisyChannel=Bayes’Rule§  Weseeanobserva*onxofamisspelledword§  Findthecorrectwordŵ

12

w = argmaxw∈V

P(w | x)

= argmaxw∈V

P(x |w)P(w)P(x)

= argmaxw∈V

P(x |w)P(w)

Bayes

3

Introduc)ontoInforma)onRetrieval

History:Noisychannelforspellingproposedaround1990§  IBM

§  Mays,Eric,FredJ.DamerauandRobertL.Mercer.1991.Contextbasedspellingcorrec*on.Informa)onProcessingandManagement,23(5),517–522

§  AT&TBellLabs§  Kernighan,MarkD.,KennethW.Church,andWilliamA.Gale.1990.Aspellingcorrec*onprogrambasedonanoisychannelmodel.ProceedingsofCOLING1990,205-210

Introduc)ontoInforma)onRetrieval

Non-wordspellingerrorexample

acress

14

Introduc)ontoInforma)onRetrieval

Candidategenera*on§  Wordswithsimilarspelling

§  Smalleditdistancetoerror

§  Wordswithsimilarpronuncia*on§  Smalldistanceofpronuncia*ontoerror

§  Inthisclasslecturewemostlywon’tdwellonefficientcandidategenera*on

§  Alotmoreaboutcandidategenera*onintheaccompanyingCourseramaterial

15

Introduc)ontoInforma)onRetrieval

CandidateTes*ng:Damerau-Levenshteineditdistance§  Minimaleditdistancebetweentwostrings,whereeditsare:§  Inser*on§  Dele*on§  Subs*tu*on§  Transposi*onoftwoadjacentlecers

§  SeeIIRsec3.3.3foreditdistance

16

Introduc)ontoInforma)onRetrieval

Wordswithin1ofacressError Candidate

Correc(onCorrectLeDer

ErrorLeDer

Type

acress actress t - dele*on

acress cress - a inser*on

acress caress ca ac transposi*on

acress access c r subs*tu*on

acress across o e subs*tu*on

acress acres - s inser*on 17

Introduc)ontoInforma)onRetrieval

Candidategenera*on§  80%oferrorsarewithineditdistance1§  Almostallerrorswithineditdistance2

§  Alsoallowinser*onofspaceorhyphen§  thisidea àthis idea§  inlaw à in-law

§  Canalsoallowmergingwords§  data base àdatabase§  Forshorttextslikeaquery,canjustregardwholestringasoneitemfromwhichtoproduceedits

18

4

Introduc)ontoInforma)onRetrieval

Howdoyougeneratethecandidates?1.  Runthroughdic*onary,checkeditdistancewitheach

word2.  Generateallwordswithineditdistance≤k(e.g.,k=1

or2)andthenintersectthemwithdic*onary3.  Useacharacterk-gramindexandfinddic*onary

wordsthatshare“most”k-gramswithword(e.g.,byJaccardcoefficient)§  seeIIRsec3.3.4

4.  ComputethemfastwithaLevenshteinfinitestatetransducer

5.  Haveaprecomputedmapofwordstopossiblecorrec*ons 19

Introduc)ontoInforma)onRetrieval

Aparadigm…§  Wewantthebestspellcorrec*ons§  Insteadoffindingtheverybest,we

§  Findasubsetofprecygoodcorrec*ons§  (say,editdistanceatmost2)

§  Findthebestamongstthem

§  Thesemaynotbetheactualbest§  ThisisarecurringparadigminIRincludingfindingthebestdocsforaquery,bestanswers,bestads…§  Findagoodcandidateset§  FindthetopKamongstthemandreturnthemasthebest

20

Introduc)ontoInforma)onRetrieval

Let’ssaywe’vegeneratedcandidates:NowbacktoBayes’Rule§  Weseeanobserva*onxofamisspelledword§  Findthecorrectwordŵ

21

w = argmaxw∈V

P(w | x)

= argmaxw∈V

P(x |w)P(w)P(x)

= argmaxw∈V

P(x |w)P(w) What’sP(w)?

Introduc)ontoInforma)onRetrieval

LanguageModel§  Takeabigsupplyofwords(yourdocumentcollec*onwithTtokens);letC(w)=#occurrencesofw

§  Inotherapplica*ons–youcantakethesupplytobetypedqueries(suitablyfiltered)–whenasta*cdic*onaryisinadequate

22

P(w) = C(w)T

Introduc)ontoInforma)onRetrieval

UnigramPriorprobability

word Frequencyofword

P(w)

actress 9,321 .0000230573

cress 220 .0000005442

caress 686 .0000016969

access 37,038 .0000916207

across 120,844 .0002989314

acres 12,874 .000031846323

Countsfrom404,253,213wordsinCorpusofContemporaryEnglish(COCA)

Introduc)ontoInforma)onRetrieval

Channelmodelprobability§  Errormodelprobability,Editprobability§  Kernighan,Church,Gale1990

§  Misspelledwordx=x1,x2,x3…xm

§  Correctwordw=w1,w2,w3,…,wn

§  P(x|w)=probabilityoftheedit§  (dele*on/inser*on/subs*tu*on/transposi*on)

24

5

Introduc)ontoInforma)onRetrieval

Compu*ngerrorprobability:confusion“matrix”del[x,y]: count(xy typed as x)ins[x,y]: count(x typed as xy)sub[x,y]: count(y typed as x)trans[x,y]: count(xy typed as yx)

Inser*onanddele*oncondi*onedonpreviouscharacter

25

Introduc)ontoInforma)onRetrieval

Confusionmatrixforsubs*tu*on

Introduc)ontoInforma)onRetrieval

Nearbykeys

Introduc)ontoInforma)onRetrieval

Genera*ngtheconfusionmatrix§  PeterNorvig’slistoferrors§  PeterNorvig’slistofcountsofsingle-editerrors

§  AllPeterNorvig’sngramsdatalinks:hcp://norvig.com/ngrams/

28

Introduc)ontoInforma)onRetrieval

Channelmodel

29

P (x|w) =

8>>>>>>>><

>>>>>>>>:

del[wi�1,wi]count[wi�1wi]

, if deletion

ins[wi�1,xi]count[wi�1]

, if insertion

sub[xi,wi]count[wi]

, if substitution

trans[wi,wi+1]count[wiwi+1]

, if transposition

Kernighan,Church,Gale1990

Introduc)ontoInforma)onRetrieval

Smoothingprobabili*es:Add-1smoothing§  Butifweusetheconfusionmatrixexample,unseenerrorsareimpossible!

§  They’llmaketheoverallprobability0.Thatseemstooharsh§  e.g.,inKernighan’schartqèaandaèqareboth0,eventhoughthey’readjacentonthekeyboard!

§  Asimplesolu*onistoadd1toallcountsandthenifthereisa|A|characteralphabet,tonormalizeappropriately:

30

If substitution, P(x |w) = sub[x,w]+1count[w]+ A

6

Introduc)ontoInforma)onRetrieval

ChannelmodelforacressCandidateCorrec(on

CorrectLeDer

ErrorLeDer

x|w P(x|w)

actress t - c|ct .000117

cress - a a|# .00000144

caress ca ac ac|ca .00000164

access c r r|c .000000209

across o e e|o .0000093

acres - s es|e .0000321

acres - s ss|s .0000342 31

Introduc)ontoInforma)onRetrieval

NoisychannelprobabilityforacressCandidateCorrec(on

CorrectLeDer

ErrorLeDer

x|w P(x|w) P(w) 109*P(x|w)*P(w)

actress t - c|ct .000117 .0000231 2.7

cress - a a|# .00000144 .000000544 .00078

caress ca ac ac|ca

.00000164 .00000170 .0028

access c r r|c .000000209 .0000916 .019

across o e e|o .0000093 .000299 2.8

acres - s es|e .0000321 .0000318 1.0

acres - s ss|s .0000342 .0000318 1.032

Introduc)ontoInforma)onRetrieval

NoisychannelprobabilityforacressCandidateCorrec(on

CorrectLeDer

ErrorLeDer

x|w P(x|w) P(w) 109*P(x|w)P(w)

actress t - c|ct

.000117 .0000231 2.7

cress - a a|# .00000144 .000000544 .00078

caress ca ac ac|ca

.00000164 .00000170 .0028

access c r r|c .000000209 .0000916 .019

across o e e|o .0000093 .000299 2.8

acres - s es|e

.0000321 .0000318 1.0

acres - s ss|s

.0000342 .0000318 1.033

Introduc)ontoInforma)onRetrieval

Evalua*on§  Somespellingerrortestsets

§  Wikipedia’slistofcommonEnglishmisspelling§  Aspellfilteredversionofthatlist§  Birkbeckspellingerrorcorpus§  PeterNorvig’slistoferrors(includesWikipediaandBirkbeck,fortrainingortes*ng)

34

Introduc)ontoInforma)onRetrieval

SPELLING CORRECTION WITH THE NOISY CHANNEL

Context-Sensi*veSpellingCorrec*on

Introduc)ontoInforma)onRetrieval

Real-wordspellingerrors

§  …leaving in about fifteen minuets to go to her house.§  The design an construction of the system…§  Can they lave him my messages?§  The study was conducted mainly be John Black.

§  25-40%ofspellingerrorsarerealwordsKukich1992

36

7

Introduc)ontoInforma)onRetrieval

Context-sensi*vespellingerrorfixing§  Foreachwordinsentence(phrase,query…)

§  Generatecandidateset§  theworditself§  allsingle-lecereditsthatareEnglishwords§ wordsthatarehomophones§  (allofthiscanbepre-computed!)

§  Choosebestcandidates§ Noisychannelmodel

37

Introduc)ontoInforma)onRetrieval

Noisychannelforreal-wordspellcorrec*on

§  Givenasentencew1,w2,w3,…,wn

§  Generateasetofcandidatesforeachwordwi§  Candidate(w1)={w1,w’1,w’’1,w’’’1,…}§  Candidate(w2)={w2,w’2,w’’2,w’’’2,…}§  Candidate(wn)={wn,w’n,w’’n,w’’’n,…}

§  ChoosethesequenceWthatmaximizesP(W)

Introduc)ontoInforma)onRetrieval

Incorpora*ngcontextwords:Context-sensi*vespellingcorrec*on

§  Determiningwhetheractressoracrossisappropriatewillrequirelookingatthecontextofuse

§  Wecandothiswithabecerlanguagemodel§  Youlearned/canlearnalotaboutlanguagemodelsinCS124orCS224N

§  Herewepresentjustenoughtobedangerous/dotheassignment

§  Abigramlanguagemodelcondi*onstheprobabilityofawordon(just)thepreviousword

P(w1…wn)=P(w1)P(w2|w1)…P(wn|wn−1)

39

Introduc)ontoInforma)onRetrieval

Incorpora*ngcontextwords§  Forunigramcounts,P(w)isalwaysnon-zero

§  ifourdic*onaryisderivedfromthedocumentcollec*on

§  Thiswon’tbetrueofP(wk|wk−1).Weneedtosmooth§  Wecoulduseadd-1smoothingonthiscondi*onaldistribu*on

§  Buthere’sabecerway–interpolateaunigramandabigram:

Pli(wk|wk−1)=λPuni(wk)+(1−λ)Pbi(wk|wk−1)§ Pbi(wk|wk−1)=C(wk−1,wk)/C(wk−1)

40

Introduc)ontoInforma)onRetrieval

Alltheimportantfinepoints§  Notethatwehaveseveralprobabilitydistribu*onsfor

words§  Keepthemstraight!

§  Youmightwant/needtoworkwithlogprobabili*es:§  logP(w1…wn)=logP(w1)+logP(w2|w1)+…+logP(wn|wn−1)§  Otherwise,beverycarefulaboutfloa*ngpointunderflow

§  Ourquerymaybewordsanywhereinadocument§  We’llstartthebigrames*mateofasequencewithaunigrames*mate

§  O~en,peopleinsteadcondi*ononastart-of-sequencesymbol,butnotgoodhere

§  Becauseofthis,theunigramandbigramcountshavedifferenttotals–notaproblem

41

Introduc)ontoInforma)onRetrieval

Usingabigramlanguagemodel

§  “a stellar and versatile acress whose combination of sass and glamour…”

§  CountsfromtheCorpusofContemporaryAmericanEnglishwithadd-1smoothing

§  P(actress|versatile)=.000021 P(whose|actress) = .0010§  P(across|versatile) =.000021 P(whose|across) = .000006

§  P(“versatile actress whose”) = .000021*.0010 = 210 x10-10§  P(“versatile across whose”) = .000021*.000006 = 1 x10-10

42

8

Introduc)ontoInforma)onRetrieval

Usingabigramlanguagemodel

§  “a stellar and versatile acress whose combination of sass and glamour…”

§  CountsfromtheCorpusofContemporaryAmericanEnglishwithadd-1smoothing

§  P(actress|versatile)=.000021 P(whose|actress) = .0010§  P(across|versatile) =.000021 P(whose|across) = .000006

§  P(“versatile actress whose”) = .000021*.0010 = 210 x10-10§  P(“versatile across whose”) = .000021*.000006 = 1 x10-10

43

Introduc)ontoInforma)onRetrieval

Noisychannelforreal-wordspellcorrec*on

44

two of thew

to threw

on

thawofftao

thetoo

oftwo thaw

...

Introduc)ontoInforma)onRetrieval

Noisychannelforreal-wordspellcorrec*on

45

two of thew

to threw

on

thawofftao

thetoo

oftwo thaw

...

Introduc)ontoInforma)onRetrieval

Simplifica*on:Oneerrorpersentence

§  Outofallpossiblesentenceswithonewordreplaced§  w1,w’’2,w3,w4twooffthew§  w1,w2,w’3,w4twoofthe§  w’’’1,w2,w3,w4tooofthew§  …

§  ChoosethesequenceWthatmaximizesP(W)

Introduc)ontoInforma)onRetrieval

Wheretogettheprobabili*es§  Languagemodel

§  Unigram§  Bigram§  etc.

§  Channelmodel§  Sameasfornon-wordspellingcorrec*on§  Plusneedprobabilityfornoerror,P(w|w)

47

Introduc)ontoInforma)onRetrieval

Probabilityofnoerror§  Whatisthechannelprobabilityforacorrectlytypedword?

§  P(“the”|“the”)§  Ifyouhaveabigcorpus,youcanes*matethispercentcorrect

§  Butthisvaluedependsstronglyontheapplica*on§  .90(1errorin10words)§  .95(1errorin20words)§  .99(1errorin100words)

48

9

Introduc)ontoInforma)onRetrieval

PeterNorvig’s“thew”example

49

x w x|w P(x|w) P(w)109P(x|w)P(w)

thew the ew|e 0.000007 0.02 144

thew thew 0.95 0.00000009 90

thew thaw e|a 0.001 0.0000007 0.7

thew threwh|hr 0.000008 0.000004 0.03

thew thweew|we 0.000003 0.00000004 0.0001

Introduc)ontoInforma)onRetrieval

Stateoftheartnoisychannel

§  Weneverjustmul*plythepriorandtheerrormodel§  Independenceassump*onsàprobabili*esnotcommensurate

§  Instead:Weightthem

§  Learnλfromadevelopmenttestset

50

w = argmaxw∈V

P(x |w)P(w)λ

Introduc)ontoInforma)onRetrieval

Improvementstochannelmodel§  Allowricheredits(BrillandMoore2000)

§  entàant§  phàf§  leàal

§  Incorporatepronuncia*onintochannel(ToutanovaandMoore2002)

§  Incorporatedeviceintochannel§  NotallAndroidphonesneedhavethesameerrormodel§  Butspellcorrec*onmaybedoneatthesystemlevel

51