lecturer: prof. paola velardi lab lecturer and project...
TRANSCRIPT
WEBANDSOCIALINFORMATIONEXTRACTION
Lecturer:Prof.PaolaVelardiLabLecturerandprojectcoordinator:Dr.GiovanniS:lo
22/02/18
Aboutthiscourse
§ hCp://twiki.di.uniroma1.it/twiki/view/Estrinfo/WebHome
§ (Slidesandcoursematerial)§ PLEASESIGNTOGOOGLEGROUP(toreceiveselfassessmentsandothercourse-relatedinfo)
§ Courseisorganizedasfollows:§ 2/3“standard”lectures§ 1/3Lab:
§ designofanIRsystemwithLucene,§ UsingTwiCerAPI§ Implemen:crawlers§ Presenta:onof2018Project22/02/18
Lectures§ PartI:webinforma:onretrieval
§ Architectureofaninforma:onretrievalsystem§ Textprocessing,indexing§ Ranking:vectorspacemodel,latentseman:cindexing§ Webinforma:onretrieval:browsing,scraping§ Webinforma:onretrieval:linkanalysis(HITS,PageRank)
§ PartII:socialnetworkanalysis§ Modelingasocialnetwork:localandglobalmeasures§ Communitydetec:on§ Miningsocialnetworks:
§ opinionmining,§ temporalmining,§ userprofilingandRecommenders
22/02/18
Informa:onRetrievalis:§ Informa:onRetrieval(IR)isfindingmaterial(usuallydocuments)ofanunstructurednature(usuallytext)thatsa:sfiesaninforma:onneedfromlargecollec:ons(usuallystoredoncomputers).
§ “Usually”text,butmoreandmore:images,videos,data,services,audio..
§ “Usually”unstructured(=nopre-definedmodel)but:Xml(anditsdialectse.g.Voicexml..),RDF,htmlare”morestructured”thantxtorpdf
§ “Large”collec:ons:howlarge??TheWeb!(TheIndexedWebcontainsatleast50billionpages)
22/02/18
IRvs.databases:Structuredvsunstructureddata
§ Structureddatatendstorefertoinforma:onin“tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000 Ivy Smith
Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
22/02/18
Unstructureddata§ Typicallyreferstofree-formtext§ Allows:
§ Keywordqueriesincludingoperators§ (informa:on∧(retrieval∨extrac:on))
§ Moresophis:cated“concept”queries,e.g.,§ findallwebpagesdealingwithdrugabuse
22/02/18
Semi-structureddata§ Infactalmostnodatais“unstructured”§ E.g.,thisslidehasdis:nctlyiden:fiedzonessuchastheTitleandBullets
§ Thisstructureallowsfor“semi-structured”searchsuchas§ Titlecontains“data”ANDBulletscontain“search”§ Onlyplaintxtformatistrulyunstructured(thoughevennaturallanguagedoeshaveastructure:a:tle,paragraphs,punctua:on..)
22/02/18
Unstructured(text)vs.structured(database)datafrom2007to2014(exabyte)
Blue=unstructuredYellow=structured
Notonlytextretrieval:otherIRtasks§ Clustering:Givenasetofdocs,groupthemintoclusters
basedontheircontents.§ Classifica:on:Givenasetoftopics,plusanewdocd,decide
whichtopic(s)dbelongsto(egspam-nospam).§ Informa:onExtrac:on:Findallsentence“snippets”dealing
withagiventopic(e.g.companymerges)§ Ques:onAnswering:dealwithawiderangeofques:ontypes
including:facts,lists,defini:ons,How,Why,hypothe:cal,seman:callyconstrained,andcross-lingualques:ons
§ OpinionMining:Analyse/summarizesen:mentinatext(e.g.TripAdvisor)(HotTopic!!)
§ Alltheabove,appliedtoimages,video,audio,socialnetworks
22/02/18
Terminology
Searching: Seeking for specific information within a body of information. The result of a search is a set of hits (e.g. the list of web pages matching a query). Browsing: Unstructured exploration of a body of information (e.g. a web browser is a software to traverse and retrieve info on the WWW). Crawling: Moving from one item to another following links, such as citations, references, etc. Scraping: pulling specific content from web pages
22/02/18
Terminology(2)
• Query: A string of text, describing the information that the user is seeking. Each word of the query is called a search term or keyword.
• A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols.
• Full text searching: Methods that compare the query with every word in the text, without distinguishing the function (meaning, part-of-speech, position) of the various words.
• Fielded searching: Methods that search on specific bibliographic or structural fields, such as author or heading.
22/02/18
ExamplesofSearchSystems
Find file on a computer system (e.g., Spotlight for Mac).
Library catalog for searching bibliographic records about books and other objects (e.g., Library of Congress catalog).
Abstracting and indexing system to find research information about specific topics (e.g., Medline for medical information).
Web search service to find web pages (e.g., Google).
22/02/18
User Interface
Text Operations
Query Operations Indexing
Searching
Ranking
Index
Text
query
user need
user feedback
ranked docs
retrieved docs
logical view logical view
inverted file
DB Manager Module
1
2
3
Text Database
Text
4
5
6
7
8
More in detail (representation, indexing, comparison, ranking)
22/02/18
How many pages on the web in 2017?
how many page web 2017
How ∧ many ∧ page ∧ web ∧ 2017
Representation:a data structure describing the content of a document
Tables (= vectors)
clouds
22/02/18
Indexing:
a data structure that improves the speed of word retrieval
Points at words in texts
22/02/18
When a user submits a query to a search system, the system returns a set of hits. With a large collection of documents, the set of hits maybe very large.
The value to the user depends on the order in which the hits are presented.
Three main methods:
• Sorting the hits, e.g., by date (more recent are better.. Is this true?)
• Ranking the hits by similarity between query and document
• Ranking the hits by the importance of the documents
Sorting & ranking
22/02/18
1. Document Representation
§ Objec:ve:givenadocumentinwhateverformat(txt,html,pdf..)provideaformal,structuredrepresenta:onofthedocument(e.g.avectorwhoseaCributesarewords,oragraph,or..)
§ SeveralstepsfromdocumentdownloadingtothefinalselectedrepresentaQon
§ Themostcommonrepresenta:onmodelis“bagofwords”
22/02/18
Thebag-of-wordsmodel
di=(..,..,…after,..attend,..both,..build,.before, ..center, college,…computer,.dinner,………..university,..work)
WORD ORDER DOES NOT MATTER!!! 22/02/18
BagofWordsModel
§ Thisisthemostcommonwayofrepresen:ngdocumentsininforma:onretrieval
§ Variantsofthismodelinclude(moredetailslater):§ Howtoweightawordwithinadocument(boolean,m*idf,etc.)
§ Boolean:1isthewordiisindocj,0else§ Tf*idfandothers:theweightisafunc:onofthewordfrequencyinthedocument,andofthefrequencyofdocumentswhiththatword
§ Whatisa“word”:§ single,inflectedword(“going”),§ lemma:sedword(going,go,goneàgo)§ Mul:-word,propernouns,numbers,dates(“boardofdirectors”,“JohnWyne”,“April,2010”
§ Meaning:(plan,project,designàPLAN#03)22/02/18
Phasesindocumentprocessing(Severalstepsfromdocumentdownloadingtothefinalselected
representaQon)
1. Documentparsing2. Tokeniza:on3. Stopwords/
Normaliza:on4. POSTagging5. Stemming6. DeepNLAnalysis7. Indexing
Note that intermediate steps can be skipped
Deepanalysis
Stemming
POStagging
22/02/18
1.DocumentParsingDocumentparsingimpliesscanningadocumentandtransformingitintoa“bagofwords”but:whichwords?§ Weneedtodealwithformatandlanguageofeachdocument.§ Whatformatisitin?§ pdf/word/excel/html?§ Whatlanguageisitin?§ Whatcharactersetisinuse(la:n,greek,chinese..)?
Each of these is a classification problem, which we will study later in the course.
But these tasks are often done heuristically …
Sec. 2.1
22/02/18
(Docparsing)Complica:ons:Format/language§ Documentsbeingindexedcanincludedocsfrommanydifferentlanguages§ Asingleindexmayhavetocontaintermsofseverallanguages.
§ Some:mesadocumentoritscomponentscancontainmul:plelanguages/formats§ ex:FrenchemailwithaGermanpdfaCachment.
§ Whatisa“unit”document?§ Asinglefile?Azippedgroupoffiles?§ Anemail/message?§ Anemailwith5aCachments?§ Asinglewebpageofafullwebsite?
Sec. 2.1
22/02/18
2.Tokeniza:on§ Input:“Friends,RomansandCountrymen”§ Output:Tokens
§ Friends§ Romans§ Countrymen
§ Atokenisaninstanceofasequenceofcharacters§ Eachsuchtokenisnowacandidateforanindexentry,aserfurtherprocessing§ Describedlater
§ Butwhicharevalidtokenstoemit?
Sec. 2.2.1
22/02/18
2.Tokeniza:on(cont’d)§ Issuesintokeniza:on:
§ Finland’scapital→Finland?Finlands?Finland’s?§ Hewle9-Packard→Hewle9andPackardastwotokens?§ state-of-the-art:breakuphyphenatedsequence.§ co-educa?on§ lowercase,lower-case,lowercase?
§ SanFrancisco:onetokenortwo?§ Howdoyoudecideitisonetoken?§ cheapSanFrancisco-LosAngelesfares
Sec. 2.2.1
22/02/18
2.Tokeniza:on:Numbers§ 3/12/91 § Mar.12,1991 § 12/3/91§ 55B.C.§ B-52§ (800)234-2333§ 1Z9999W99845399981(packagetrackingnumbers)
§ Osenhaveembeddedspaces(ex.IBAN/SWIFT)§ OlderIRsystemsmaynotindexnumbers
§ Sincetheirpresencegreatlyexpandsthesizeofthevocabulary§ IRsystemsoWenindexseparatelydocument“meta-data”
§ Crea:ondate,format,etc.
Sec. 2.2.1
22/02/18
2.Tokeniza:on:languageissues§ French&Italianapostrophes
§ L'ensemble→onetokenortwo?§ L?L’?Le?§ Wemaywantl’ensembletomatchwithunensemble
§ Germannouncompoundsarenotsegmented§ LebensversicherungsgesellschaWsangestellter§ ‘lifeinsurancecompanyemployee’§ GermanretrievalsystemsbenefitgreatlyfromacompoundspliYer
module
Sec. 2.2.1
22/02/18
2.Tokeniza:on:languageissues
§ ChineseandJapanesehavenospacesbetweenwords:§ 莎拉波娃现在居住在美国东南部的佛罗里达。
§ Notalwaysguaranteedauniquetokeniza:on§ FurthercomplicatedinJapanese,withmul:plealphabetsintermingled§ Dates/amountsinmul:pleformats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
Sec. 2.2.1
22/02/18
2.Tokeniza:on:languageissues§ Arabic(orHebrew)isbasicallywriCenrighttoleW,butwithcertainitemslikenumberswriCenlestoright
§ Wordsareseparated,butleCerformswithinawordformcomplexligatures
§ ←→←→←start§ ‘Algeriaachieveditsindependencein1962a>er132yearsofFrenchoccupaBon.’
§ Bidirec:onalityisnotaproblemiftextiscodedinUnicode.
Sec. 2.2.1
22/02/18
3.1Stopwords§ Withastoplist,youexcludefromthedic:onaryen:relythecommonestwords.Intui:on:§ TheyhaveliCleseman:ccontent:the,a,and,to,be§ Therearealotofthem:~30%ofpos:ngsfortop30words§ StopwordeliminaQonusedtobestandardinolderIRsystems.
§ Butthetrendisawayfromdoingthis:§ Goodcompressiontechniquesmeansthespaceforincluding
stopwordsinasystemisverysmall-soremovingthemnotabigdeal§ Goodqueryop:miza:ontechniquesmeanyoupayliYleatquery:me
forincludingstopwords.§ Youneedthemfor:
§ Phrasequeries:“KingofDenmark”§ Varioussong/books:tles,etc.:“Letitbe”,“Tobeornottobe”§ “Rela:onal”queries:“flightstoLondon”vrs“flightfromLondon”
Sec. 2.2.2
22/02/18
3.2.Normaliza:ontoterms§ Weneedto“normalize”wordsinindexedtextaswellas
querywordsintothesameform§ WewanttomatchU.S.A.andUSA
§ Resultisterms:atermisa(normalized)wordtype,whichisasingleentryinourIRsystemdic:onary
§ Wemostcommonlyimplicitlydefineequivalenceclassesoftermsby,e.g.,§ dele:ngperiodstoformaterm
§ U.S.A.,USAàUSA
§ dele:nghyphenstoformaterm§ an?-discriminatory,an?discriminatoryàan?discriminatory
§ Synonyms(thisisrathermorecomplex..)§ car,automobile
Sec. 2.2.3
22/02/18
3.2Normaliza:on:otherlanguages§ Accents:e.g.,Frenchrésumévs.resume.§ Umlauts:e.g.,German:Tuebingenvs.Tübingen
§ Shouldbeequivalent§ Mostimportantcriterion:
§ Howareyourusersliketowritetheirqueriesforthesewords?
§ Eveninlanguagesthatstandardlyhaveaccents,usersosenmaynottypethem§ Osenbesttonormalizetoade-accentedterm
§ Tuebingen,Tübingen,TubingenàTubingen
Sec. 2.2.3
22/02/18
3.2Normaliza:on:otherlanguages§ Normaliza:onofotherstringslikedateforms
§ 7月30日 vs. 7/30 § Japanese use of kana vs. Chinese characters
§ Tokeniza:onandnormaliza:onmaydependonthelanguageandsoisinterweavedwithlanguagedetec:on
§ Crucial:Needto“normalize”indexedtextaswellasquerytermsintothesameform
Morgen will ich in MIT … Is this
German “mit”?
Sec. 2.2.3
22/02/18
3.2Casefolding§ ReduceallleCerstolowercase
§ excep:on:uppercaseinmid-sentence§ e.g.,GeneralMotors§ Fedvs.fed§ MITvs.mit
§ Osenbesttolowercaseeverything,sinceuserswilluselowercaseregardlessof‘correct’capitaliza:on…
§ Thismaycausedifferentsensestobemerged..OsenthemostrelevantissimplythemostfrequentontheWEB,ratherthanthemostintui:ve
Sec. 2.2.3
22/02/18
3.2Normaliza:on:Synonyms§ Dowehandlesynonymsandhomonyms?
§ E.g.,byhand-constructedequivalenceclasses§ car=automobile color=colour
§ Wecanrewritetoformequivalence-classterms§ Whenthedocumentcontainsautomobile,indexitundercar-automobile(andvice-versa)
§ Orwecanexpandaquery§ Whenthequerycontainsautomobile,lookundercaraswell
§ Whataboutspellingmistakes?§ OneapproachisSoundex,aphone:calgorithmtoencode
homophones(=samesound)tothesamerepresenta:onsothattheycanbematcheddespiteminordifferencesinspelling
§ GoogleàGoogol
22/02/18
4.Stemming/Lemma:za:on§ Reduceinflec:onal/variantformstobaseform§ E.g.,
§ am,are,is→be
§ car,cars,car's,cars'→car
§ theboy'scarsaredifferentcolors→theboycarbedifferentcolor§ Lemma:za:onimpliesdoing“proper”reduc:ontodic:onaryform(thelemma).
§ Rela:velysimpleforEnglishmorecomplexforhighlyinflectedlanguages(italian,german..)
Sec. 2.2.4
22/02/18
4.Stemming§ Reducetermstotheir“roots”beforeindexing§ “Stemming”suggestcrudeaffixchopping
§ languagedependent§ e.g.,automate(s),automa?c,automa?onallreducedtoautomat.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compress and compress ar both accept as equival to compress
Sec. 2.2.4
22/02/18
Porter’salgorithm§ CommonestalgorithmforstemmingEnglish
§ Resultssuggestit’satleastasgoodasotherstemmingop:ons
§ Conven:ons+5phasesofreduc:ons§ phasesappliedsequen:ally§ eachphaseconsistsofasetofcommands§ sampleconven:on:outofmulBpleapplyingrulesinacommand,selecttheonethatappliestothelongestsuffix.
§ E.g.caressesàcaressratherthancar
§
Sec. 2.2.4
22/02/18
Typicalrules(commands)inPorterstemmer§ sses→sscaresses→caress§ ies→Iponies→poni§ SS→SScaress→caress
§ eS→cats→cat
§ Weightofword-sensi:verules:§ (m>1)EMENT→
§ replacement→replac§ cement→cement
§ Itmeans:rootmustbelongerthan1toapplytherule..
Sec. 2.2.4
22/02/18
64
Threestemmers:AcomparisonSampletext:Suchananalysiscanrevealfeaturesthatarenoteasily
visiblefromthevariaQonsintheindividualgenesandcanleadtoapictureofexpressionthatismorebiologicallytransparentandaccessibletointerpretaQonPorter’s:suchananalysicanrevealfeaturthatarnoteasilivisiblfromthevariatintheindividugeneandcanleadtopicturofexpressthatismorebiologtransparandaccesstointerpretLovins’s:suchananalyscanrevefeaturthatarnoteasvisfromthvariinthindividugenandcanleadtoapicturofexpresthatismorbiologtransparandaccestointerpresPaice’s:suchananalyscanrevfeatthatarenoteasyvisfromthevaryintheindividgenandcanleadtoapictofexpressthatismorbiologtranspandaccesstointerpret22/02/18
5.DeepNaturalLanguageAnalysis§ HastodowithmoredetailedNaturalLanguageProcessingalgorithms
§ E.g.seman:cdisambigua:on,phraseindexing(boardofdirectors),nameden::es(PresidentObama=BarakObama)etc.
§ Standardsearchenginesincreasinglyusedeepertechniques(e.g.Google’sKnowledgeGraphhCp://www.google.com/insidesearch/features/search/knowledge.htmlandRankBrain)
§ More(ondeepNLPtechniques)inNLPcourse(nothere)!
22/02/18
Whyindexing§ Thepurposeofstoringanindexistoop:mizespeedandperformanceinfindingrelevantdocumentsforasearchquery.
§ Withoutanindex,thesearchenginewouldscaneverydocumentinthedocumentarchive,whichwouldrequireconsiderable:meandcompu:ngpower(especiallyifarchive=thefullwebcontent).
§ Forexample,whileanindexof10,000documentscanbequeriedwithinmilliseconds,asequen:alscanofeverywordin10,000largedocumentscouldtakehours.
22/02/18
Invertedindex
What happens if the word Caesar is added to, e.g., document #14?
Sec. 1.2
For each term, we have a list that records which documents the term occurs in. The list is called posting list.
Weneedvariable-sizeposQngslists(documentcontentisosendynamic!!)22/02/18
Tokenizer Token stream Friends Romans Countrymen
Invertedindexconstruc:on
Linguistic modules
Modified tokens friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Documents to be indexed
Friends, Romans, Countrymen.
Sec. 1.2
22/02/18
Indexersteps:Tokensequence§ Sequenceof(Modifiedtoken,DocumentID)pairs.
I did enact Julius Caesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be with Caesar. The noble
Brutus hath told you Caesar was ambitious
Doc 2
Sec. 1.2
22/02/18
Indexersteps:Dic:onary&Pos:ngs§ Mul:pletermentriesinasingledocumentaremerged.
§ SplitintoDic:onaryandPos:ngs
§ Doc.frequencyinformaQonisadded.
Whyfrequency?Willdiscusslater.
Sec. 1.2
22/02/18
Dic:onarydatastructuresforinvertedindexes§ Thedic:onarydatastructurestoresthetermvocabulary,documentfrequency,pointerstoeachpos:ngslist…inwhatdatastructure?
Sec. 3.1
77
Array§ Aswejusthaveshown,thesimpleststructureisanarray:
§ Searching(lookup)anitemisO(log(n));inser:nganitemO(n)§ Canwestoreadic:onaryinmemorymoreefficiently?
Sec. 3.1
78
Alterna:veDic:onarydatastructures§ Twomainchoices:
§ Hashtables§ Trees
§ SomeIRsystemsusehashtables,sometrees
Sec. 3.1
79
Hashtables(1)
1.Eachvocabularytermishashedto(mappedonto)aninteger(Weassumeyouhaveseenhashtablesbefore)
§ E.g.theindexforaspecifickeywordwillbeequaltosumofASCIIvaluesofcharactersmul:pliedbytheirrespec:veorderinthestringaserwhichitismodulowith2069(primenumber).
Sec. 3.1
80
Hashtables(2)§ 2.Thekeywordisstoredinthehashtablewhereitcanbequicklyretrievedusinghashedkey.
22/02/18
Hashtables(4)§ Toachieveagoodhashingmechanism,itisimportanttohavea
goodhashfunc:onwiththefollowingbasicrequirements:§ Easytocompute:Itshouldbeeasytocomputeandmustnotbecomeanalgorithminitself.
§ Uniformdistribu:on:Itshouldprovideauniformdistribu:onacrossthehashtableandshouldnotresultinclustering.
§ Lesscollisions:Collisionsoccurwhenpairsofelementsaremappedtothesamehashvalue.Theseshouldbeavoided.
Note:Irrespec:veofhowgoodahashfunc:onis,collisionsoccur.Therefore,tomaintaintheperformanceofahashtable,itisimportanttomanagecollisionsthroughvariouscollisionresoluQontechniques.
22/02/18
Hashtables(5)§ Separatechainingisoneofthemostcommonlyusedcollisionresolu:ontechniques.
§ Itisusuallyimplementedusinglinkedlists.
22/02/18
HashTables(6)§ Pros:
§ Lookupisfaster:O(1)(average)(dependsonprobabilityofcollisionsandcollision-handling).Inser:onisfasterO(1)
§ Canreducestoragerequirementifn(numberofkeywords)ismuchsmallerthantheuniverseofkeysU
§ Cons:§ good,generalpurposehashfunc:onsareverydifficulttofind
§ sta:ctablesizerequirescostlyresizingifindexedsetishighlydynamic
§ searchperformancedegradesconsiderablyasthetablenearsitscapacity(toomanycollisions)
22/02/18
AB-treeofordermisam-waytreethatsa:sfiesthefollowingcondi:ons.
§ Everynodehas<mchildren.§ Everyinternalnode(excepttheroot)hask<m/2children.
§ Theroothas>2children.§ Aninternalnodewithkchildrencontains(k-1)orderedkeys.ItsleWmostchildcontainskeyslessthanorequaltothefirstkeyinthenode.Thesecondchildcontainskeysgreaterthanthefirstkeysbutlessthanorequaltothesecondkey,andsoon.
Tree:B-tree(balancedtrees)
a-hu hy-m
n-z
Sec. 3.1
86
Inser:ngintoaB-Tree§ ToinsertkeyvaluexintoaB-Tree:§ UsetheB-Treesearchtodetermineonwhichnodetomaketheinser:on.
§ Insertxintoappropriateposi:ononthatleafnode.§ Ifresul:ngnumberofkeysonthatnode<k,thensimplyoutputthatnodeandreturn.
§ Otherwise,splitthenode.
B-Treesprosandcons§ Treesrequireastandardorderingofcharactersandhence
strings…butwetypicallyhaveone§ Pros:
§ Sizenotlimitedasforhashing§ Cons:
§ Searchisslower(wrthashtables):O(h)whereh=log(n)isdepthofB-tree[andthisrequiresbalancedtree]
§ B-treesmi:gatetherebalancingproblem(wrtstandardbinarytrees)
Sec. 3.1
91
Summary§ Oncekeywordsareextracted,wecreatepos:nglists
§ Howdowesearch/insertakeyword?Array,HashTables,Trees(B-trees,B+trees)
§ Nextproblemis:Howdoweprocessaquery(=searchingdocumentswithsomecombina:onofkeywords)?
Sec. 1.3
22/02/18
Queryprocessing:AND§ Considerprocessingthequery:
BrutusANDCaesar§ LocateBrutusintheDic:onary;
§ Retrieveitspos:ngs(e.g.pointerstodocumentsincludingBrutus).
§ LocateCaesarintheDic:onary;§ Retrieveitspos:ngs.
§ “Merge”thetwopos:ngs:
128 34
2 4 8 16 32 64 1 2 3 5 8 13 21
Brutus Caesar
Sec. 1.3
22/02/18
The“merge”opera:on§ Walkthroughthetwopos:ngssimultaneouslyfromrighttoles,in:melinearinthetotalnumberofpos:ngsentries
34 128 2 4 8 16 32 64
1 2 3 5 8 13 21 128
34 2 4 8 16 32 64 1 2 3 5 8 13 21
Brutus Caesar 2 8
If list lengths are x and y, merge takes O(x+y) operations. Crucial: postings sorted by docID.
Sec. 1.3
22/02/18
Op:miza:onofindexsearch
§ Whatisthebestorderofwordsforqueryprocessing?§ ConsideraquerythatisanANDofnterms.§ Foreachofthenterms,getitspos:ngs,thenANDthem
together.
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
Query:BrutusANDCalpurniaANDCaesar102
Sec. 1.3
22/02/18
Queryop:miza:onexample§ Processwordsinorderofincreasingfreq:
§ startwithsmallestset(wordwithsmallestlist),thenkeepcuYngfurther.
This is why we kept document freq. in dictionary
Executethequeryas(CalpurniaANDBrutus)ANDCaesar.
Sec. 1.3
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
22/02/18
Moregeneralop:miza:on§ e.g.,(maddingORcrowd)AND(ignobleORstrife)
§ Getdoc.freq.’sforallterms.§ Es:matethesizeofeachORbythesumofitsdoc.freq.’s(conserva:ve).
§ ProcessANDinincreasingorderofORsizes.
Sec. 1.3
22/02/18
Example
§ Recommendaqueryprocessingorderfor:
Term Freq eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
300321
379571
363465
(kaleydoscopeOReyes)AND(tangerineORtrees)AND(marmaladeORskies)
22/02/18
ORsize
Skippointers§ Intersec:onisthemostimportantopera:onwhenitcomestosearchengines.
§ Thisisbecauseinwebsearch,mostqueriesareimplicitlyintersecQons:e.g."carrepairs","britneyspearssongs"etc.translatesinto–"carANDrepairs","britneyANDspearsANDsongs",whichmeansitwillbeintersec:ng2ormorepos:ngslistsinordertoreturnaresult.
§ Becauseintersec:onissocrucial,searchenginestrytospeeditupinanypossibleway.Onesuchwayistouseskippointers.
22/02/18
Augmentpos:ngswithskippointers(atindexing:me)
§ Idea:movecursoronapos:nglisttothefirstpos:ngwhereDocIDisequalorlargerthanthecomparedone.
§ Why?Toavoidunnecessarycomparisons§ Wheredoweplaceskippointers?
128 2 4 8 41 48 64
31 1 2 3 8 11 17 21 31 11
41 128
Sec. 2.3
22/02/18
Example:Step1
22/02/18
Someitemshavetwopointers,onepoin:ngtoanadjacentitemtheotherskippingafewitemsahead
Step2
22/02/18
Saywearecomparingp2=8andp1=34:since8<34,wemustmovethepointertotheright.But8hasaskip!Sowefirstcomparetheidoftheskip(31)withtheidofp1(34).
Anotherexample
Example: Start using the normal intersection algorithm. Continue until the match 12 and advance to the next item in each list (48, 13). At this point the "car" list is on 48 and the "repairs" list is on 13, but 13 has a skip pointer. Check the value the skip pointer is pointing at (i.e. 29) and if this value is less than the current value of the "car" list (which it is 48 in our example), we follow our skip pointer and jump to this value in the list. It would be useless to compare all elements between the current and subsequent skip!! 22/02/18
Wheredoweplaceskips?§ Tradeoff:
§ Moreskips→shorterskipspans⇒morelikelytoskip.Butlotsofcomparisonstoskippointers.
§ Fewerskips→fewpointercomparison,butthenlongskipspans⇒fewsuccessfulskips.
22/02/18
Placingskips§ Simpleheuris:c:forpos:ngsoflengthL,use√Levenly-spacedskippointers.
§ Thisignoresthedistribu:onofqueryterms.§ Easyiftheindexisrela:velysta:c;harderifLkeepschangingbecauseofupdates.
§ Howmuchdoskippointershelp?§ Tradi:onally,CPUswereslow,theyusedtohelpalot.
§ Buttoday’sCPUsarefastanddiskisslow,soreducingdiskposQngslistsizedominates.
22/02/18
Phrasequeries§ Wanttobeabletoanswerqueriessuchas"redbrickhouse"–asaphrase
§ redANDbrickANDhousematchphrasessuchas"redhousenearthebrickfactory”whichisnotwhatwearesearchingfor§ Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworks
§ About10%ofwebqueriesarephrasequeries.§ Forthis,itnolongersufficestostoreonly<term:docs>entries22/02/18
AfirstaCempt:Bi-wordindexes§ IndexeveryconsecuQvepairoftermsinthetextasaphrase
§ Forexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwords§ friendsromans§ romanscountrymen
§ Eachofthesebiwordsisnowadic:onaryterm§ Two-wordphrasequery-processingisnowimmediate.
Sec. 2.4.1
22/02/18
Longerphrasequeries§ Longerphrasesareprocessedusingbi-words:§ stanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:
stanforduniversityANDuniversitypaloANDpaloalto
Sec. 2.4.1
22/02/18
Extendedbiwords§ Parsetheindexedtextandperformpart-of-speech-tagging
(POSTagging).§ Iden:fyNouns(N)andar:cles/preposi:ons(X).§ CallanystringoftermsoftheformNX*N(regex)anextended
biword(nounfollowedbyar:cle/prepfollowedbyanythingfollowedbynoun).§ Eachsuchextendedbiwordisnowmadeaterminthedic:onary.
§ Example:catcherintheryeNXXN
§ Queryprocessing:parseitintoN’sandX’s§ Segmentqueryintoenhancedbiwords§ Lookupinindex:catcherrye(NN)
Sec. 2.4.1
22/02/18
Issuesforbiwordindexes§ Indexblowupduetobiggerdic:onary
§ Infeasibleformorethanbiwords,bigevenforthem
§ Biwordindexesarenotthestandardsolu:on(forallbiwords)butcanbepartofacompoundstrategy
Sec. 2.4.1
22/02/18
Solu:on2:Posi:onalindexes§ PosiQonalindexesareamoreefficientalterna:vetobiword
indexes.§ Inanon-posiQonalindexeachposQngisadocumentID§ InaposiQonalindexeachposQngisadocIDandalistof
posiQons
<term,numberofdocscontainingterm;doc1:posi:on1,posi:on2…;doc2:posi:on1,posi:on2…;etc.>
Sec. 2.4.2
22/02/18
Proximitysearch§ Wejustsawhowtouseaposi:onalindexforphrasesearches(phrase:sequenceofconsecu:vewords).
§ Wecanalsouseitforproximitysearch.§ Forexample:employment/4place:FindalldocumentsthatcontainEMPLOYMENTandPLACEwithin4wordsofeachother.
§ “Employmentagenciesthatplacehealthcareworkersareseeinggrowth“isahit.
§ “Employmentagenciesthathavelearnedtoadaptnowplacehealthcareworkers”isnotahit.
Sec. 2.4.2
22/02/18
Proximitysearch
§ Usetheposi:onalindex§ Simplestalgorithm:lookatcross-productofposi:onsof(i)
EMPLOYMENTindocumentand(ii)PLACEindocument§ Veryinefficientforfrequentwords,especiallystopwords§ Notethatwewanttoreturntheactualmatchingposi:ons,notjustalistofdocuments.
22/02/18
131
Proximityintersec:onAn algorithm for proximity intersection of postings lists p1 and p2. The algorithm finds places where the two terms appear within k words of each other and returns a list of triples giving docID and the term position in p1 and p2.
22/02/18
Example(searchfora,batmaxdistancek=2)
§ 1:123456789§ axbxxbaxb
§ I=<3>(pos(b))§ <1,1,3><DocID,pos(a),pos(b)>§ I=<3,6>§ I=<6>,<1,7,6>§ etc22/02/18
Posi:onalindexsize§ Needanentryforeachoccurrence,notjustonceperdocument
§ Indexsizedependsonaveragedocumentsize§ Averagewebpagehas<1000terms§ SECfilings,books,evensomeepicpoems…easily100,000terms
§ Consideratermwithfrequency0.1%
100 1 100,000
1 1 1000
Positional postings Postings Document size
Sec. 2.4.2
22/02/18
Posi:onalindexsize§ Posi:onalindexexpandspos:ngsstoragesubstan?ally
§ someroughrulesofthumbaretoexpectaposi:onalindextobe2to4Qmesaslargeasanon-posi:onalindex
§ Posi:onalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries
Sec. 2.4.2
22/02/18
Combinedscheme§ Biword indexes and positional indexes can be profitably
combined. § Many biwords are extremely frequent: Michael Jackson,
Britney Spears etc § For these biwords, increased speed compared to
positional postings intersection is substantial. § Combination scheme: Include frequent biwords as
vocabulary terms in the index. Do all other phrases by positional intersection.
22/02/18
Googleindexingsystem
§ Googleischangingthewaytohandleitsindexcon:nuously
§ Seeanhistoryon:hCp://moz.com/google-algorithm-change
22/02/18
Caffeine+Panda,GoogleIndex§ MajorrecentchangeshavebeenCaffeine&Panda§ Caffeine:
§ Oldindexhadseverallayers,someofwhichwererefreshedatafasterratethanothers(theyhaddifferentindexes);themainlayerwouldupdateeverycoupleofweeks(“Googledance”)
§ Caffeineanalyzesthewebinsmallpor:onsandupdatesearchindexonaconQnuousbasis,globally.Asnewpagesarefound,ornewinforma:ononexis:ngpages,theseareaddedstraighttotheindex.
§ Panda:aimstopromotethehighqualitycontentsitebydoomingtherankoflowqualitycontentsites.
§ Note:CaffeineisparteoftheINDEXINGstrategy,notsearching(laterinthiscourse)
22/02/18