informa(on retrieval - clsp · quickly find documents that contain both terms. – (this operaon...
Post on 25-Sep-2020
4 Views
Preview:
TRANSCRIPT
Informa(onRetrieval
Dr.QaiserAbbasDepartmentofComputerScience&IT,
UniversityofSargodha,Sargodha,40100,Pakistanqaiser.abbas@uos.edu.pk
Saturday,27February16 1
ProcessingBooleanqueries• HowdoweprocessaqueryusinganinvertedindexandthebasicBoolean
retrievalmodel?Considerprocessingthesimpleconjunc-vequery:BrutusANDCalpurniaovertheinvertedindexparSallyshowninFigure1.3.We:– 1.LocateBrutusintheDicSonary– 2.RetrieveitsposSngs– 3.LocateCalpurniaintheDicSonary– 4.RetrieveitsposSngs– 5.IntersectthetwoposSngslists,asshowninFigure1.5.
Saturday,27February16 2
ProcessingBooleanqueries
• Theintersec-onoperaSonisthecrucialone:weneedtoefficientlyintersectposSngslistssoastobeabletoquicklyfinddocumentsthatcontainbothterms.– (ThisoperaSonissomeSmesreferredtoasmergingposSngslists:thisslightlycounterintuiSve(contrarytointuiSonortocommon-senseexpectaSon.)namereflectsusingthetermmergealgorithmforageneralfamilyofalgorithm;herewearemergingthelistswithalogicalANDoperaSon.)
• ThereisasimpleandeffecSvemethodofintersecSngposSngslistsusingthemergealgorithm(seeFigure1.6).
Saturday,27February16 3
ProcessingBooleanqueries
Saturday,27February16 4
ProcessingBooleanqueries
Saturday,27February16 5
• wemaintainpointersintobothlistsandwalkthroughthetwoposSngslistssimultaneously,inSmelinearinthetotalnumberofposSngsentries.
• Ateachstep,wecomparethedocIDpointedtobybothpointers.Iftheyarethesame,weputthatdocIDintheresultslist,andadvancebothpointers.OtherwiseweadvancethepointerpoinSngtothesmallerdocID.
• IfthelengthsoftheposSngslistsarexandy,theintersecSontakesO(x+y)operaSons.
• Tousethisalgorithm,itiscrucialthatposSngsbesortedbyasingleglobalordering.UsinganumericsortbydocIDisonesimplewaytoachievethis.
ProcessingBooleanqueries
Saturday,27February16 6
• Exercise1.4(page12)Forthequeriesbelow,canwesSllrunthroughtheintersecSoninSmeO(x+y),wherexandyarethelengthsoftheposSngslistsforBrutusandCaeser?Ifnot,whatcanweachieve?a.BrutusANDNOTCaeserb.BrutusORNOTCaeserSolu(ona.Page10ofthebookdefinesthecomplexityofqueryingO(N)asO(x+y)wherexandyarelengthsoftheposSngsliststobeintersected.ForthegivencondiSonBrutusANDNOTCaeser,considerthefollowingposSngslistCase1-whentheposSngslistforBrutushaslessernumberofposSngsthanthatforCaeser:– Brutus1 3 10 21– Caesar1 6 9 23 45 57WehavetofindthesetofdocumentsthathaveBrutusanddonothaveCaeser.Weusethefollowinglogic
ProcessingBooleanqueries
Saturday,27February16 7
PosiSonpointerp1tothefirstposSngintheposSngslistforBrutusandpointerp2tothefirstposSnginposSngslistfortermCaesar.ComparetheDocIDspointedbyeachpointer(CompareDocID(p1)andDocID(p2)
1. IfDocID(p1)=DocID(p2),thenitmeansthatthedocIDinthatposSngcontainsboththetermsBrutusandCaeser.Wedonotwantthis.SomovetothenextposSnginboththelists.Gotopoint2(compareoperaSon).
2. IfDocID(p1)<DocID(p2),thenitmeansthatDocID(p1)hasthetermBrutusANDNOTCaeser.Thisiswhatwewant,sostoretheDocID(p1)inananswerarrayMovethepointerfortermBrutustothenextposSnginthelist.Gotopoint2(CompareoperaSon)
3. IfDocID(p1)>DocID(p2),WemovethepointerforCaesertothenextposSng.Gotopoint2(compareoperaSon).
WerunthecompareandincrementloopSllthep1pointstoNULLandthenwestoptheoperaSon.
WedonotneedtoruntheoperaSonSllp2pointstoNULLaswerequiretheDocIDsthathaveBrutusinit.e.g.wedonothavetoconsiderposSngs45and57inthelistforCaeser.
Answer=3,10,21
ProcessingBooleanqueries
Saturday,27February16 8
Thus,theComplexityofQueryingisO(x+y1),wherexisthelengthoftheposSngslistforthetermthathastobeintheexpressionandy1isthelengthoftheposSngslisttraversed,forthetermtobeexcluded,whenxreachesnull.Inthiscase,O(x+y1)<=O(x+y),wherey1<=yCase2-WhenposSngslistforBrutusisgreaterthanthatforCaeser.Brutus1 5 11 21 45 55Caeser1 11 170Eveninthiscase,theenSrelengthofposSngslistforBrutushastobetraversed(x),onlythatlengthofposSngslistforCaeserhastobetraversed(y1),Sllp1reachesnull,ThustheComplexityofQueryingisO(x+y1),wherey1<=y
ProcessingBooleanqueries
Saturday,27February16 9
Solu(onb.ForBrutusORNOTCaeser,weneedtofinddocumentshavingthetermBrutus,cannothaveCaeserOrnothavingthetermCaeser,canhaveBrutusorcannothaveBrutus.Brutus1 5 11 21 45 55Caeser1 11 170Other2 10 11 33 34HeretheenSrelength(x)ofposSngslistforBrutushastobetraversedtofindDocIDsthatcontainBrutus,TheenSrelength(z)ofposSngslistforOtherhastobetraversed.Similarly,Thelength(y1)ofposSngslistforCaeserhastobetraversedSllxandzbothreachNULL.ThustheComplexityofQueryingisO(x+y1+z),wherey1<=y
ProcessingBooleanqueries
Saturday,27February16 10
• WecanextendtheintersecSonoperaSontoprocessmorecomplicatedquerieslike:– (BrutusORCaesar)ANDNOTCalpurnia
• Queryop)miza)onistheprocessofselecSnghowtoorganizetheworkofansweringaquerysothattheleasttotalamountofworkneedstobedonebythesystem.– AmajorelementofthisforBooleanqueriesistheorderinwhichposSngslistsareaccessed.Whatisthebestorderforqueryprocessing?
ProcessingBooleanqueries
Saturday,27February16 11
• ConsideraquerythatisanANDoftterms,forinstance:– BrutusANDCaesarANDCalpurnia
• Foreachofthetterms,weneedtogetitsposSngs,thenANDthemtogether.ThestandardheurisScistoprocesstermsinorderofincreasingdocumentfrequency:– ifwestartbyintersecSngthetwosmallestposSngslists,
thenallintermediateresultsmustbenobiggerthanthesmallestposSngslist,andwearethereforelikelytodotheleastamountoftotalwork.So,fortheposSngslistsinFigure1.3,weexecutetheabovequeryas:
– (CalpurniaANDBrutus)ANDCaesar• ThisisafirstjusSficaSonforkeepingthefrequencyof
termsinthedicSonary:itallowsustomakethisordering-decisionbasedonin-memorydatabeforeaccessinganyposSngslist.
ProcessingBooleanqueries
Saturday,27February16 12
• ConsidernowtheopSmizaSonofmoregeneralqueries,suchas:– (maddingORcrowd)AND(ignobleORstrife)AND(killedORslain)
• Asbefore,wewillgetthefrequenciesforallterms,andwecanthen(conservaSvely)esSmatethesizeofeachORbythesumofthefrequenciesofitsdisjuncts.WecanthenprocessthequeryinincreasingorderofthesizeofeachdisjuncSveterm.
ProcessingBooleanqueries
Saturday,27February16 13
• Exercise1.7[⋆]– Recommendaqueryprocessingorderfor(tangerineORtrees)AND(marmaladeORskies)AND(kaleidoscopeOReyes)
– giventhefollowingposSngslistsizes:
ProcessingBooleanqueries
Saturday,27February16 14
• Exercise1.7[⋆]– (kaleidoscopeOReyes)(300,321)AND(tangerineORtrees)(363,465)AND(marmaladeORskies)(379,571)
– However,dependingontheactualdistribuSonofposSngs,(tangerineORtrees)maywellbelongerthan(marmaladeORskies),becausethetwocomponentsoftheformeraremoreasymmetric.
AssignmentNo.2
Saturday,27February16 15
AssignmentNo.2
Saturday,27February16 16
TheextendedBooleanmodelversusrankedretrieval
• TheBooleanretrievalmodelcontrastswithrankedretrievalmodelssuchasthevectorspacemodel(SecSon6.3),inwhichuserslargelyusefreetextqueries,thatis,justtypingoneormorewordsratherthanusingapreciselanguagewithoperatorsforbuildingupqueryexpressions,andthesystemdecideswhichdocumentsbestsaSsfythequery.
• AstrictBooleanexpressionovertermswithanunorderedresultssetistoolimitedformanyoftheinformaSonneedsthatpeoplehave,andthesesystemsimplementedextendedBooleanretrievalmodelsbyincorporaSngaddiSonaloperatorssuchastermproximityoperators.
• Aproximityoperatorisawayofspecifyingthattwotermsinaquerymustoccurclosetoeachotherinadocument,whereclosenessmaybemeasuredbylimiSngtheallowednumberofinterveningwordsorbyreferencetoastructuralunitsuchasasentenceorparagraph.
Saturday,27February16 17
TheextendedBooleanmodelversusrankedretrieval
• Example1.1:CommercialBooleansearching:Westlaw.Westlaw(hop://www.westlaw.com/)isthelargestcommerciallegalsearchservice(intermsofthenumberofpayingsub-scribers),withoverhalfamillionsubscribersperformingmillionsofsearchesadayovertensofterabytesoftextdata.Theservicewasstartedin1975.In2005,Booleansearch(called“TermsandConnectors”byWestlaw)wassSllthedefault,andusedbyalargepercentageofusers,althoughrankedfreetextquerying(called“NaturalLanguage”byWestlaw)wasaddedin1992.HerearesomeexampleBooleanqueriesonWestlaw:– Informa-onneed:InformaSononthelegaltheoriesinvolvedinprevenSngthe
disclosureoftradesecretsbyemployeesformerlyemployedbyacompeSngcompany.Query:"tradesecret"/sdisclos!/sprevent/semploye!
– Informa-onneed:Requirementsfordisabledpeopletobeabletoaccessawork-place.Query:disab!/paccess!/swork-sitework-place(employment/3place)
– Informa-onneed:Casesaboutahost’sresponsibilityfordrunkguests.Query:host!/p(responsib!liab!)/p(intoxicat!drunk!)/pguest
Saturday,27February16 18
TheextendedBooleanmodelversusrankedretrieval
• Notethelong,precisequeriesandtheuseofproximityoperators,bothuncommoninwebsearch.Submioedqueriesaverageabouttenwordsinlength.UnlikewebsearchconvenSons,aspacebetweenwordsrepresentsdisjuncSon(theSghtestbindingoperator),&isANDand/s,/p,and/kaskformatchesinthesamesentence,sameparagraphorwithinkwordsrespecSvely.Doublequotesgiveaphrasesearch(consecuSvewords);seeSecSon2.4(page39).TheexclamaSonmark(!)givesatrailingwildcardquery(seeSecSon3.2,page51);thusliab!matchesallwordsstarSngwithliab.AddiSonallywork-sitematchesanyofworksite,work-siteorworksite;seeSecSon2.2.1(page22).TypicalexpertqueriesareusuallycarefullydefinedandincrementallydevelopedunSltheyobtainwhatlooktobegoodresultstotheuser.
Saturday,27February16 19
TheextendedBooleanmodelversusrankedretrieval
• HerewejustmenSonafewofthemainaddiSonalthingswewouldliketobeabletodo:– WewouldliketobeoerdeterminethesetoftermsinthedicSonaryandto
provideretrievalthatistoleranttospellingmistakesandinconsistentchoiceofwords.
– Itisovenusefultosearchforcompoundsorphrasesthatdenoteaconceptsuchas“operaSngsystem”.AstheWestlawexamplesshow,wemightalsowishtodoproximityqueriessuchasGatesNEARMicrosoL.Toanswersuchqueries,theindexhastobeaugmentedtocapturetheproximiSesoftermsindocuments.
– ABooleanmodelonlyrecordstermpresenceorabsence,butovenwewouldliketoaccumulatemorefrequentevidence.TobeabletodothisweneedtermfrequencyinformaSoninposSngslists.
– Booleanqueriesjustretrieveasetofmatchingdocuments,butcommonlywewishtohaveaneffecSvemethodtoorder(or“rank”)thereturnedresults.Thisrequireshavingamechanismfordeterminingadocumentscorewhichencapsulateshowgoodamatchadocumentisforaquery.
Saturday,27February16 20
TheextendedBooleanmodelversusrankedretrieval
• Exercise1.12[⋆]WriteaqueryusingWestlawsyntaxwhichwouldfindanyofthewordsprofessor,teacher,orlecturerinthesamesentenceasaformoftheverbexplain.
Saturday,27February16 21
top related