lecturer: prof. paola velardi lab lecturer and project...

WEBANDSOCIALINFORMATIONEXTRACTION

Lecturer:Prof.PaolaVelardiLabLecturerandprojectcoordinator:Dr.GiovanniS:lo

22/02/18

Aboutthiscourse

§  hCp://twiki.di.uniroma1.it/twiki/view/Estrinfo/WebHome

§  (Slidesandcoursematerial)§  PLEASESIGNTOGOOGLEGROUP(toreceiveselfassessmentsandothercourse-relatedinfo)

§  Courseisorganizedasfollows:§  2/3“standard”lectures§  1/3Lab:

§  designofanIRsystemwithLucene,§  UsingTwiCerAPI§  Implemen:crawlers§  Presenta:onof2018Project22/02/18

Lectures§  PartI:webinforma:onretrieval

§  Architectureofaninforma:onretrievalsystem§  Textprocessing,indexing§  Ranking:vectorspacemodel,latentseman:cindexing§  Webinforma:onretrieval:browsing,scraping§  Webinforma:onretrieval:linkanalysis(HITS,PageRank)

§  PartII:socialnetworkanalysis§  Modelingasocialnetwork:localandglobalmeasures§  Communitydetec:on§  Miningsocialnetworks:

§  opinionmining,§  temporalmining,§  userprofilingandRecommenders

22/02/18

PARTIINFORMATIONRETRIEVAL:

DEFINITIONANDARCHITECTURE

22/02/18

Informa:onRetrievalis:§  Informa:onRetrieval(IR)isfindingmaterial(usuallydocuments)ofanunstructurednature(usuallytext)thatsa:sfiesaninforma:onneedfromlargecollec:ons(usuallystoredoncomputers).

§  “Usually”text,butmoreandmore:images,videos,data,services,audio..

§  “Usually”unstructured(=nopre-definedmodel)but:Xml(anditsdialectse.g.Voicexml..),RDF,htmlare”morestructured”thantxtorpdf

§  “Large”collec:ons:howlarge??TheWeb!(TheIndexedWebcontainsatleast50billionpages)

22/02/18

Indexedpages(Google):50billionwebpages!Bingisonlyaround5.

22/02/18

IRvs.databases:Structuredvsunstructureddata

§  Structureddatatendstorefertoinforma:onin“tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000 Ivy Smith

Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.

22/02/18

Unstructureddata§  Typicallyreferstofree-formtext§  Allows:

§ Keywordqueriesincludingoperators§ (informa:on∧(retrieval∨extrac:on))

§ Moresophis:cated“concept”queries,e.g.,§ findallwebpagesdealingwithdrugabuse

22/02/18

Semi-structureddata§  Infactalmostnodatais“unstructured”§  E.g.,thisslidehasdis:nctlyiden:fiedzonessuchastheTitleandBullets

§  Thisstructureallowsfor“semi-structured”searchsuchas§  Titlecontains“data”ANDBulletscontain“search”§  Onlyplaintxtformatistrulyunstructured(thoughevennaturallanguagedoeshaveastructure:a:tle,paragraphs,punctua:on..)

22/02/18

Unstructured(text)vs.structured(database)datafrom2007to2014(exabyte)

Blue=unstructuredYellow=structured

Total(Unstructured)EnterpriseDataGrowth2005-2015

22/02/18The business is now unstructured data!

Notonlytextretrieval:otherIRtasks§  Clustering:Givenasetofdocs,groupthemintoclusters

basedontheircontents.§  Classifica:on:Givenasetoftopics,plusanewdocd,decide

whichtopic(s)dbelongsto(egspam-nospam).§  Informa:onExtrac:on:Findallsentence“snippets”dealing

withagiventopic(e.g.companymerges)§  Ques:onAnswering:dealwithawiderangeofques:ontypes

including:facts,lists,defini:ons,How,Why,hypothe:cal,seman:callyconstrained,andcross-lingualques:ons

§  OpinionMining:Analyse/summarizesen:mentinatext(e.g.TripAdvisor)(HotTopic!!)

§  Alltheabove,appliedtoimages,video,audio,socialnetworks

22/02/18

Terminology

Searching: Seeking for specific information within a body of information. The result of a search is a set of hits (e.g. the list of web pages matching a query). Browsing: Unstructured exploration of a body of information (e.g. a web browser is a software to traverse and retrieve info on the WWW). Crawling: Moving from one item to another following links, such as citations, references, etc. Scraping: pulling specific content from web pages

22/02/18

Terminology(2)

•  Query: A string of text, describing the information that the user is seeking. Each word of the query is called a search term or keyword.

•  A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols.

•  Full text searching: Methods that compare the query with every word in the text, without distinguishing the function (meaning, part-of-speech, position) of the various words.

•  Fielded searching: Methods that search on specific bibliographic or structural fields, such as author or heading.

22/02/18

ExamplesofSearchSystems

Find file on a computer system (e.g., Spotlight for Mac).

Library catalog for searching bibliographic records about books and other objects (e.g., Library of Congress catalog).

Abstracting and indexing system to find research information about specific topics (e.g., Medline for medical information).

Web search service to find web pages (e.g., Google).

22/02/18

22/02/18

Find file

LibraryCatalogue

22/02/18

Abstrac:ng&Indexing

22/02/18

WebSearch

22/02/18

ARCHITECTUREOFANIRSYSTEM

22/02/18

22/02/18

Inside The IR Black Box

22/02/18

User Interface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical view logical view

inverted file

DB Manager Module

1

2

3

Text Database

Text

4

5

6

7

8

More in detail (representation, indexing, comparison, ranking)

22/02/18

How many pages on the web in 2017?

how many page web 2017

How ∧ many ∧ page ∧ web ∧ 2017


22/02/18

Representation:a data structure describing the content of a document

Tables (= vectors)

clouds

22/02/18


22/02/18

Indexing:

a data structure that improves the speed of word retrieval

Points at words in texts

22/02/18


22/02/18

Sorting & Ranking: how well a retrieved document matches the user’s needs?

Eclipse

22/02/18

When a user submits a query to a search system, the system returns a set of hits. With a large collection of documents, the set of hits maybe very large.

The value to the user depends on the order in which the hits are presented.

Three main methods:

• Sorting the hits, e.g., by date (more recent are better.. Is this true?)

• Ranking the hits by similarity between query and document

• Ranking the hits by the importance of the documents

Sorting & ranking

22/02/18

Moredetailson§  Representa:on§  Indexing§  Ranking(next4-6lessons)

22/02/18

1. Document Representation

§  Objec:ve:givenadocumentinwhateverformat(txt,html,pdf..)provideaformal,structuredrepresenta:onofthedocument(e.g.avectorwhoseaCributesarewords,oragraph,or..)

§  SeveralstepsfromdocumentdownloadingtothefinalselectedrepresentaQon

§  Themostcommonrepresenta:onmodelis“bagofwords”

22/02/18

Thebag-of-wordsmodel

di=(..,..,…after,..attend,..both,..build,.before, ..center, college,…computer,.dinner,………..university,..work)

WORD ORDER DOES NOT MATTER!!! 22/02/18

BagofWordsModel

§  Thisisthemostcommonwayofrepresen:ngdocumentsininforma:onretrieval

§  Variantsofthismodelinclude(moredetailslater):§  Howtoweightawordwithinadocument(boolean,m*idf,etc.)

§  Boolean:1isthewordiisindocj,0else§  Tf*idfandothers:theweightisafunc:onofthewordfrequencyinthedocument,andofthefrequencyofdocumentswhiththatword

§  Whatisa“word”:§  single,inflectedword(“going”),§  lemma:sedword(going,go,goneàgo)§  Mul:-word,propernouns,numbers,dates(“boardofdirectors”,“JohnWyne”,“April,2010”

§  Meaning:(plan,project,designàPLAN#03)22/02/18

BagofWordsmodelisalsousedforimages(“words”arenowimagefeatures)

22/02/18

Phasesindocumentprocessing(Severalstepsfromdocumentdownloadingtothefinalselected

representaQon)

1.  Documentparsing2.  Tokeniza:on3.  Stopwords/

Normaliza:on4.  POSTagging5.  Stemming6.  DeepNLAnalysis7.  Indexing

Note that intermediate steps can be skipped

Deepanalysis

Stemming

POStagging

22/02/18

1.DocumentParsingDocumentparsingimpliesscanningadocumentandtransformingitintoa“bagofwords”but:whichwords?§ Weneedtodealwithformatandlanguageofeachdocument.§  Whatformatisitin?§ pdf/word/excel/html?§  Whatlanguageisitin?§  Whatcharactersetisinuse(la:n,greek,chinese..)?

Each of these is a classification problem, which we will study later in the course.

But these tasks are often done heuristically …

Sec. 2.1

22/02/18

(Docparsing)Complica:ons:Format/language§  Documentsbeingindexedcanincludedocsfrommanydifferentlanguages§  Asingleindexmayhavetocontaintermsofseverallanguages.

§  Some:mesadocumentoritscomponentscancontainmul:plelanguages/formats§  ex:FrenchemailwithaGermanpdfaCachment.

§  Whatisa“unit”document?§  Asinglefile?Azippedgroupoffiles?§  Anemail/message?§  Anemailwith5aCachments?§  Asinglewebpageofafullwebsite?

Sec. 2.1

22/02/18

2.Tokeniza:on§  Input:“Friends,RomansandCountrymen”§  Output:Tokens

§  Friends§  Romans§  Countrymen

§  Atokenisaninstanceofasequenceofcharacters§  Eachsuchtokenisnowacandidateforanindexentry,aserfurtherprocessing§  Describedlater

§  Butwhicharevalidtokenstoemit?

Sec. 2.2.1

22/02/18

2.Tokeniza:on(cont’d)§  Issuesintokeniza:on:

§  Finland’scapital→Finland?Finlands?Finland’s?§  Hewle9-Packard→Hewle9andPackardastwotokens?§  state-of-the-art:breakuphyphenatedsequence.§  co-educa?on§  lowercase,lower-case,lowercase?

§  SanFrancisco:onetokenortwo?§  Howdoyoudecideitisonetoken?§  cheapSanFrancisco-LosAngelesfares

Sec. 2.2.1

22/02/18

2.Tokeniza:on:Numbers§  3/12/91 §  Mar.12,1991 §  12/3/91§  55B.C.§  B-52§  (800)234-2333§  1Z9999W99845399981(packagetrackingnumbers)

§  Osenhaveembeddedspaces(ex.IBAN/SWIFT)§  OlderIRsystemsmaynotindexnumbers

§  Sincetheirpresencegreatlyexpandsthesizeofthevocabulary§  IRsystemsoWenindexseparatelydocument“meta-data”

§  Crea:ondate,format,etc.

Sec. 2.2.1

22/02/18

2.Tokeniza:on:languageissues§  French&Italianapostrophes

§  L'ensemble→onetokenortwo?§  L?L’?Le?§  Wemaywantl’ensembletomatchwithunensemble

§  Germannouncompoundsarenotsegmented§  LebensversicherungsgesellschaWsangestellter§  ‘lifeinsurancecompanyemployee’§  GermanretrievalsystemsbenefitgreatlyfromacompoundspliYer

module

Sec. 2.2.1

22/02/18

2.Tokeniza:on:languageissues

§  ChineseandJapanesehavenospacesbetweenwords:§  莎拉波娃现在居住在美国东南部的佛罗里达。

§  Notalwaysguaranteedauniquetokeniza:on§  FurthercomplicatedinJapanese,withmul:plealphabetsintermingled§  Dates/amountsinmul:pleformats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

Sec. 2.2.1

22/02/18

2.Tokeniza:on:languageissues§  Arabic(orHebrew)isbasicallywriCenrighttoleW,butwithcertainitemslikenumberswriCenlestoright

§  Wordsareseparated,butleCerformswithinawordformcomplexligatures

§  ←→←→←start§  ‘Algeriaachieveditsindependencein1962a>er132yearsofFrenchoccupaBon.’

§  Bidirec:onalityisnotaproblemiftextiscodedinUnicode.

Sec. 2.2.1

22/02/18

UNICODEstandard

22/02/18

Tokeniza:ononGoogle?

22/02/18

3.1Stopwords§  Withastoplist,youexcludefromthedic:onaryen:relythecommonestwords.Intui:on:§  TheyhaveliCleseman:ccontent:the,a,and,to,be§  Therearealotofthem:~30%ofpos:ngsfortop30words§  StopwordeliminaQonusedtobestandardinolderIRsystems.

§  Butthetrendisawayfromdoingthis:§  Goodcompressiontechniquesmeansthespaceforincluding

stopwordsinasystemisverysmall-soremovingthemnotabigdeal§  Goodqueryop:miza:ontechniquesmeanyoupayliYleatquery:me

forincludingstopwords.§  Youneedthemfor:

§  Phrasequeries:“KingofDenmark”§  Varioussong/books:tles,etc.:“Letitbe”,“Tobeornottobe”§  “Rela:onal”queries:“flightstoLondon”vrs“flightfromLondon”

Sec. 2.2.2

22/02/18

StopwordsonGoogle?

22/02/18

3.2.Normaliza:ontoterms§  Weneedto“normalize”wordsinindexedtextaswellas

querywordsintothesameform§  WewanttomatchU.S.A.andUSA

§  Resultisterms:atermisa(normalized)wordtype,whichisasingleentryinourIRsystemdic:onary

§  Wemostcommonlyimplicitlydefineequivalenceclassesoftermsby,e.g.,§  dele:ngperiodstoformaterm

§  U.S.A.,USAàUSA

§  dele:nghyphenstoformaterm§  an?-discriminatory,an?discriminatoryàan?discriminatory

§  Synonyms(thisisrathermorecomplex..)§  car,automobile

Sec. 2.2.3

22/02/18

3.2Normaliza:on:otherlanguages§  Accents:e.g.,Frenchrésumévs.resume.§  Umlauts:e.g.,German:Tuebingenvs.Tübingen

§  Shouldbeequivalent§  Mostimportantcriterion:

§  Howareyourusersliketowritetheirqueriesforthesewords?

§  Eveninlanguagesthatstandardlyhaveaccents,usersosenmaynottypethem§  Osenbesttonormalizetoade-accentedterm

§  Tuebingen,Tübingen,TubingenàTubingen

Sec. 2.2.3

22/02/18

3.2Normaliza:on:otherlanguages§  Normaliza:onofotherstringslikedateforms

§  7月30日 vs. 7/30 §  Japanese use of kana vs. Chinese characters

§  Tokeniza:onandnormaliza:onmaydependonthelanguageandsoisinterweavedwithlanguagedetec:on

§  Crucial:Needto“normalize”indexedtextaswellasquerytermsintothesameform

Morgen will ich in MIT … Is this

German “mit”?

Sec. 2.2.3

22/02/18

Dowereallywantnormaliza:on?

22/02/18

3.2Casefolding§  ReduceallleCerstolowercase

§  excep:on:uppercaseinmid-sentence§  e.g.,GeneralMotors§  Fedvs.fed§  MITvs.mit

§  Osenbesttolowercaseeverything,sinceuserswilluselowercaseregardlessof‘correct’capitaliza:on…

§  Thismaycausedifferentsensestobemerged..OsenthemostrelevantissimplythemostfrequentontheWEB,ratherthanthemostintui:ve

Sec. 2.2.3

22/02/18

22/02/18

CasefoldingonGoogle?

22/02/18

CatCATC.A.T.SAMERESULTS!!!

3.2Normaliza:on:Synonyms§  Dowehandlesynonymsandhomonyms?

§  E.g.,byhand-constructedequivalenceclasses§  car=automobile color=colour

§  Wecanrewritetoformequivalence-classterms§  Whenthedocumentcontainsautomobile,indexitundercar-automobile(andvice-versa)

§  Orwecanexpandaquery§  Whenthequerycontainsautomobile,lookundercaraswell

§  Whataboutspellingmistakes?§  OneapproachisSoundex,aphone:calgorithmtoencode

homophones(=samesound)tothesamerepresenta:onsothattheycanbematcheddespiteminordifferencesinspelling

§  GoogleàGoogol

22/02/18

SynonymsonGoogle?

22/02/18

SpellingmistakesonGoogle?

22/02/18

4.Stemming/Lemma:za:on§  Reduceinflec:onal/variantformstobaseform§  E.g.,

§  am,are,is→be

§  car,cars,car's,cars'→car

§  theboy'scarsaredifferentcolors→theboycarbedifferentcolor§  Lemma:za:onimpliesdoing“proper”reduc:ontodic:onaryform(thelemma).

§  Rela:velysimpleforEnglishmorecomplexforhighlyinflectedlanguages(italian,german..)

Sec. 2.2.4

22/02/18

4.Stemming§  Reducetermstotheir“roots”beforeindexing§  “Stemming”suggestcrudeaffixchopping

§  languagedependent§  e.g.,automate(s),automa?c,automa?onallreducedtoautomat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress and compress ar both accept as equival to compress

Sec. 2.2.4

22/02/18

Porter’salgorithm§  CommonestalgorithmforstemmingEnglish

§  Resultssuggestit’satleastasgoodasotherstemmingop:ons

§  Conven:ons+5phasesofreduc:ons§  phasesappliedsequen:ally§  eachphaseconsistsofasetofcommands§  sampleconven:on:outofmulBpleapplyingrulesinacommand,selecttheonethatappliestothelongestsuffix.

§  E.g.caressesàcaressratherthancar

§ 

Sec. 2.2.4

22/02/18

Typicalrules(commands)inPorterstemmer§  sses→sscaresses→caress§  ies→Iponies→poni§  SS→SScaress→caress

§  eS→cats→cat

§  Weightofword-sensi:verules:§  (m>1)EMENT→

§  replacement→replac§  cement→cement

§  Itmeans:rootmustbelongerthan1toapplytherule..

Sec. 2.2.4

22/02/18

64

Threestemmers:AcomparisonSampletext:Suchananalysiscanrevealfeaturesthatarenoteasily

visiblefromthevariaQonsintheindividualgenesandcanleadtoapictureofexpressionthatismorebiologicallytransparentandaccessibletointerpretaQonPorter’s:suchananalysicanrevealfeaturthatarnoteasilivisiblfromthevariatintheindividugeneandcanleadtopicturofexpressthatismorebiologtransparandaccesstointerpretLovins’s:suchananalyscanrevefeaturthatarnoteasvisfromthvariinthindividugenandcanleadtoapicturofexpresthatismorbiologtransparandaccestointerpresPaice’s:suchananalyscanrevfeatthatarenoteasyvisfromthevaryintheindividgenandcanleadtoapictofexpressthatismorbiologtranspandaccesstointerpret22/02/18

StemmingonGoogle?

§  rose §  roses

22/02/18

5.DeepNaturalLanguageAnalysis§  HastodowithmoredetailedNaturalLanguageProcessingalgorithms

§  E.g.seman:cdisambigua:on,phraseindexing(boardofdirectors),nameden::es(PresidentObama=BarakObama)etc.

§  Standardsearchenginesincreasinglyusedeepertechniques(e.g.Google’sKnowledgeGraphhCp://www.google.com/insidesearch/features/search/knowledge.htmlandRankBrain)

§  More(ondeepNLPtechniques)inNLPcourse(nothere)!

22/02/18

22/02/18

Knowledgemap

22/02/18

2.DocumentIndexing

1. Document Representation

22/02/18

Whyindexing§  Thepurposeofstoringanindexistoop:mizespeedandperformanceinfindingrelevantdocumentsforasearchquery.

§  Withoutanindex,thesearchenginewouldscaneverydocumentinthedocumentarchive,whichwouldrequireconsiderable:meandcompu:ngpower(especiallyifarchive=thefullwebcontent).

§  Forexample,whileanindexof10,000documentscanbequeriedwithinmilliseconds,asequen:alscanofeverywordin10,000largedocumentscouldtakehours.

22/02/18

Invertedindex

What happens if the word Caesar is added to, e.g., document #14?

Sec. 1.2

For each term, we have a list that records which documents the term occurs in. The list is called posting list.

Weneedvariable-sizeposQngslists(documentcontentisosendynamic!!)22/02/18

Tokenizer Token stream Friends Romans Countrymen

Invertedindexconstruc:on

Linguistic modules

Modified tokens friend roman countryman

Indexer

Inverted index

friend

roman

countryman

2 4

2

13 16

1

Documents to be indexed

Friends, Romans, Countrymen.

Sec. 1.2

22/02/18

Indexersteps:Tokensequence§  Sequenceof(Modifiedtoken,DocumentID)pairs.

I did enact Julius Caesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be with Caesar. The noble

Brutus hath told you Caesar was ambitious

Doc 2

Sec. 1.2

22/02/18

Indexersteps:Sort

§  Sortbyterms§  Andthen“docID”

Coreindexingstep

Sec. 1.2

22/02/18

Indexersteps:Dic:onary&Pos:ngs§  Mul:pletermentriesinasingledocumentaremerged.

§  SplitintoDic:onaryandPos:ngs

§  Doc.frequencyinformaQonisadded.

Whyfrequency?Willdiscusslater.

Sec. 1.2

22/02/18

Wheredowepayinstorage?

Pointers

Termsand

counts

Sec. 1.2

ListsofdocIDs

22/02/18

Dic:onarydatastructuresforinvertedindexes§  Thedic:onarydatastructurestoresthetermvocabulary,documentfrequency,pointerstoeachpos:ngslist…inwhatdatastructure?

Sec. 3.1

77

Array§  Aswejusthaveshown,thesimpleststructureisanarray:

§  Searching(lookup)anitemisO(log(n));inser:nganitemO(n)§  Canwestoreadic:onaryinmemorymoreefficiently?

Sec. 3.1

78

Alterna:veDic:onarydatastructures§  Twomainchoices:

§  Hashtables§  Trees

§  SomeIRsystemsusehashtables,sometrees

Sec. 3.1

79

Hashtables(1)

1.Eachvocabularytermishashedto(mappedonto)aninteger(Weassumeyouhaveseenhashtablesbefore)

§  E.g.theindexforaspecifickeywordwillbeequaltosumofASCIIvaluesofcharactersmul:pliedbytheirrespec:veorderinthestringaserwhichitismodulowith2069(primenumber).

Sec. 3.1

80

Hashtables(2)§  2.Thekeywordisstoredinthehashtablewhereitcanbequicklyretrievedusinghashedkey.

22/02/18

Hashtables(4)§  Toachieveagoodhashingmechanism,itisimportanttohavea

goodhashfunc:onwiththefollowingbasicrequirements:§  Easytocompute:Itshouldbeeasytocomputeandmustnotbecomeanalgorithminitself.

§  Uniformdistribu:on:Itshouldprovideauniformdistribu:onacrossthehashtableandshouldnotresultinclustering.

§  Lesscollisions:Collisionsoccurwhenpairsofelementsaremappedtothesamehashvalue.Theseshouldbeavoided.

Note:Irrespec:veofhowgoodahashfunc:onis,collisionsoccur.Therefore,tomaintaintheperformanceofahashtable,itisimportanttomanagecollisionsthroughvariouscollisionresoluQontechniques.

22/02/18

Hashtables(5)§  Separatechainingisoneofthemostcommonlyusedcollisionresolu:ontechniques.

§  Itisusuallyimplementedusinglinkedlists.

22/02/18

HashTables(6)§  Pros:

§  Lookupisfaster:O(1)(average)(dependsonprobabilityofcollisionsandcollision-handling).Inser:onisfasterO(1)

§  Canreducestoragerequirementifn(numberofkeywords)ismuchsmallerthantheuniverseofkeysU

§  Cons:§  good,generalpurposehashfunc:onsareverydifficulttofind

§  sta:ctablesizerequirescostlyresizingifindexedsetishighlydynamic

§  searchperformancedegradesconsiderablyasthetablenearsitscapacity(toomanycollisions)

22/02/18

Root a-m n-z

a-hu hy-m n-sh si-z

Otherstructures:binarytrees

Sec. 3.1

85

AB-treeofordermisam-waytreethatsa:sfiesthefollowingcondi:ons.

§  Everynodehas<mchildren.§  Everyinternalnode(excepttheroot)hask<m/2children.

§  Theroothas>2children.§  Aninternalnodewithkchildrencontains(k-1)orderedkeys.ItsleWmostchildcontainskeyslessthanorequaltothefirstkeyinthenode.Thesecondchildcontainskeysgreaterthanthefirstkeysbutlessthanorequaltothesecondkey,andsoon.

Tree:B-tree(balancedtrees)

a-hu hy-m

n-z

Sec. 3.1

86

B-treeexample(numeric)

22/02/18

B-treeexample(keywords)

22/02/18

Keywords are stored at the leaves of the tree, known as “buckets”

Anotherexample(fullwords)

22/02/18

Inser:ngintoaB-Tree§  ToinsertkeyvaluexintoaB-Tree:§  UsetheB-Treesearchtodetermineonwhichnodetomaketheinser:on.

§  Insertxintoappropriateposi:ononthatleafnode.§  Ifresul:ngnumberofkeysonthatnode<k,thensimplyoutputthatnodeandreturn.

§  Otherwise,splitthenode.

B-Treesprosandcons§  Treesrequireastandardorderingofcharactersandhence

strings…butwetypicallyhaveone§  Pros:

§  Sizenotlimitedasforhashing§  Cons:

§  Searchisslower(wrthashtables):O(h)whereh=log(n)isdepthofB-tree[andthisrequiresbalancedtree]

§  B-treesmi:gatetherebalancingproblem(wrtstandardbinarytrees)

Sec. 3.1

91

Summary§ Oncekeywordsareextracted,wecreatepos:nglists

§ Howdowesearch/insertakeyword?Array,HashTables,Trees(B-trees,B+trees)

§ Nextproblemis:Howdoweprocessaquery(=searchingdocumentswithsomecombina:onofkeywords)?

Sec. 1.3

22/02/18

Queryprocessing:AND§  Considerprocessingthequery:

BrutusANDCaesar§  LocateBrutusintheDic:onary;

§  Retrieveitspos:ngs(e.g.pointerstodocumentsincludingBrutus).

§  LocateCaesarintheDic:onary;§  Retrieveitspos:ngs.

§  “Merge”thetwopos:ngs:

128 34

2 4 8 16 32 64 1 2 3 5 8 13 21

Brutus Caesar

Sec. 1.3

22/02/18

The“merge”opera:on§  Walkthroughthetwopos:ngssimultaneouslyfromrighttoles,in:melinearinthetotalnumberofpos:ngsentries

34 128 2 4 8 16 32 64

1 2 3 5 8 13 21 128

34 2 4 8 16 32 64 1 2 3 5 8 13 21

Brutus Caesar 2 8

If list lengths are x and y, merge takes O(x+y) operations. Crucial: postings sorted by docID.

Sec. 1.3

22/02/18

Intersec:ngtwopos:ngslists(a“merge”algorithm)

22/02/18

Step1

22/02/18

Step2

22/02/18

Intersection è 2

Step3

22/02/18

Step4

22/02/18

Step5

22/02/18

Step6

22/02/18

Op:miza:onofindexsearch

§  Whatisthebestorderofwordsforqueryprocessing?§  ConsideraquerythatisanANDofnterms.§  Foreachofthenterms,getitspos:ngs,thenANDthem

together.

Brutus

Caesar

Calpurnia

1 2 3 5 8 16 21 34

2 4 8 16 32 64 128

13 16

Query:BrutusANDCalpurniaANDCaesar102

Sec. 1.3

22/02/18

Queryop:miza:onexample§  Processwordsinorderofincreasingfreq:

§  startwithsmallestset(wordwithsmallestlist),thenkeepcuYngfurther.

This is why we kept document freq. in dictionary

Executethequeryas(CalpurniaANDBrutus)ANDCaesar.

Sec. 1.3

Brutus

Caesar

Calpurnia

1 2 3 5 8 16 21 34

2 4 8 16 32 64 128

13 16

22/02/18

Moregeneralop:miza:on§  e.g.,(maddingORcrowd)AND(ignobleORstrife)

§  Getdoc.freq.’sforallterms.§  Es:matethesizeofeachORbythesumofitsdoc.freq.’s(conserva:ve).

§  ProcessANDinincreasingorderofORsizes.

Sec. 1.3

22/02/18

Example

§  Recommendaqueryprocessingorderfor:

Term Freq eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

300321

379571

363465

(kaleydoscopeOReyes)AND(tangerineORtrees)AND(marmaladeORskies)

22/02/18

ORsize

Skippointers§  Intersec:onisthemostimportantopera:onwhenitcomestosearchengines.

§  Thisisbecauseinwebsearch,mostqueriesareimplicitlyintersecQons:e.g."carrepairs","britneyspearssongs"etc.translatesinto–"carANDrepairs","britneyANDspearsANDsongs",whichmeansitwillbeintersec:ng2ormorepos:ngslistsinordertoreturnaresult.

§  Becauseintersec:onissocrucial,searchenginestrytospeeditupinanypossibleway.Onesuchwayistouseskippointers.

22/02/18

Augmentpos:ngswithskippointers(atindexing:me)

§  Idea:movecursoronapos:nglisttothefirstpos:ngwhereDocIDisequalorlargerthanthecomparedone.

§  Why?Toavoidunnecessarycomparisons§  Wheredoweplaceskippointers?

128 2 4 8 41 48 64

31 1 2 3 8 11 17 21 31 11

41 128

Sec. 2.3

22/02/18

Example:Step1

22/02/18

Someitemshavetwopointers,onepoin:ngtoanadjacentitemtheotherskippingafewitemsahead

Step2

22/02/18

Saywearecomparingp2=8andp1=34:since8<34,wemustmovethepointertotheright.But8hasaskip!Sowefirstcomparetheidoftheskip(31)withtheidofp1(34).

Step3

22/02/18

Since31<34,weskipfrom8to31avoidinguselesscomparisonswith17and21!

Anotherexample

Example: Start using the normal intersection algorithm. Continue until the match 12 and advance to the next item in each list (48, 13). At this point the "car" list is on 48 and the "repairs" list is on 13, but 13 has a skip pointer. Check the value the skip pointer is pointing at (i.e. 29) and if this value is less than the current value of the "car" list (which it is 48 in our example), we follow our skip pointer and jump to this value in the list. It would be useless to compare all elements between the current and subsequent skip!! 22/02/18

Wheredoweplaceskips?§  Tradeoff:

§  Moreskips→shorterskipspans⇒morelikelytoskip.Butlotsofcomparisonstoskippointers.

§  Fewerskips→fewpointercomparison,butthenlongskipspans⇒fewsuccessfulskips.

22/02/18

Placingskips§  Simpleheuris:c:forpos:ngsoflengthL,use√Levenly-spacedskippointers.

§  Thisignoresthedistribu:onofqueryterms.§  Easyiftheindexisrela:velysta:c;harderifLkeepschangingbecauseofupdates.

§  Howmuchdoskippointershelp?§  Tradi:onally,CPUswereslow,theyusedtohelpalot.

§  Buttoday’sCPUsarefastanddiskisslow,soreducingdiskposQngslistsizedominates.

22/02/18

AlgorithmINTERSECTwithskippointers

22/02/18

Phrasequeries§  Wanttobeabletoanswerqueriessuchas"redbrickhouse"–asaphrase

§  redANDbrickANDhousematchphrasessuchas"redhousenearthebrickfactory”whichisnotwhatwearesearchingfor§  Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworks

§  About10%ofwebqueriesarephrasequeries.§  Forthis,itnolongersufficestostoreonly<term:docs>entries22/02/18

AfirstaCempt:Bi-wordindexes§  IndexeveryconsecuQvepairoftermsinthetextasaphrase

§  Forexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwords§  friendsromans§  romanscountrymen

§  Eachofthesebiwordsisnowadic:onaryterm§  Two-wordphrasequery-processingisnowimmediate.

Sec. 2.4.1

22/02/18

Longerphrasequeries§  Longerphrasesareprocessedusingbi-words:§  stanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:

stanforduniversityANDuniversitypaloANDpaloalto

Sec. 2.4.1

22/02/18

Extendedbiwords§  Parsetheindexedtextandperformpart-of-speech-tagging

(POSTagging).§  Iden:fyNouns(N)andar:cles/preposi:ons(X).§  CallanystringoftermsoftheformNX*N(regex)anextended

biword(nounfollowedbyar:cle/prepfollowedbyanythingfollowedbynoun).§  Eachsuchextendedbiwordisnowmadeaterminthedic:onary.

§  Example:catcherintheryeNXXN

§  Queryprocessing:parseitintoN’sandX’s§  Segmentqueryintoenhancedbiwords§  Lookupinindex:catcherrye(NN)

Sec. 2.4.1

22/02/18

Issuesforbiwordindexes§  Indexblowupduetobiggerdic:onary

§  Infeasibleformorethanbiwords,bigevenforthem

§  Biwordindexesarenotthestandardsolu:on(forallbiwords)butcanbepartofacompoundstrategy

Sec. 2.4.1

22/02/18

Solu:on2:Posi:onalindexes§  PosiQonalindexesareamoreefficientalterna:vetobiword

indexes.§  Inanon-posiQonalindexeachposQngisadocumentID§  InaposiQonalindexeachposQngisadocIDandalistof

posiQons

<term,numberofdocscontainingterm;doc1:posi:on1,posi:on2…;doc2:posi:on1,posi:on2…;etc.>

Sec. 2.4.2

22/02/18

Example:searchfor«tobe»

22/02/18

Frequencyindoc

Step1:“to”and“be”co-occurinDOC1

22/02/18

Step2

22/02/18

Notconsecu:ve!

Step3(movepointerof“to”inDOC1)

22/02/18

Notamatch!18isaser17

Step4(movepointerof“BE”indoc1)

22/02/18

Step5

22/02/18

NomatchesinDOC1!!

Step6:startlookinginDOC4

22/02/18

Aseranumberofsteps..

22/02/18

429->430433->434!!DOC4has2matches

Proximitysearch§  Wejustsawhowtouseaposi:onalindexforphrasesearches(phrase:sequenceofconsecu:vewords).

§  Wecanalsouseitforproximitysearch.§  Forexample:employment/4place:FindalldocumentsthatcontainEMPLOYMENTandPLACEwithin4wordsofeachother.

§  “Employmentagenciesthatplacehealthcareworkersareseeinggrowth“isahit.

§  “Employmentagenciesthathavelearnedtoadaptnowplacehealthcareworkers”isnotahit.

Sec. 2.4.2

22/02/18

Proximitysearch

§  Usetheposi:onalindex§  Simplestalgorithm:lookatcross-productofposi:onsof(i)

EMPLOYMENTindocumentand(ii)PLACEindocument§  Veryinefficientforfrequentwords,especiallystopwords§  Notethatwewanttoreturntheactualmatchingposi:ons,notjustalistofdocuments.

22/02/18

131

Proximityintersec:onAn algorithm for proximity intersection of postings lists p1 and p2. The algorithm finds places where the two terms appear within k words of each other and returns a list of triples giving docID and the term position in p1 and p2.

22/02/18

Example(searchfora,batmaxdistancek=2)

§  1:123456789§  axbxxbaxb

§  I=<3>(pos(b))§  <1,1,3><DocID,pos(a),pos(b)>§  I=<3,6>§  I=<6>,<1,7,6>§  etc22/02/18

Posi:onalindexsize§  Needanentryforeachoccurrence,notjustonceperdocument

§  Indexsizedependsonaveragedocumentsize§  Averagewebpagehas<1000terms§  SECfilings,books,evensomeepicpoems…easily100,000terms

§  Consideratermwithfrequency0.1%

100 1 100,000

1 1 1000

Positional postings Postings Document size

Sec. 2.4.2

22/02/18

Posi:onalindexsize§  Posi:onalindexexpandspos:ngsstoragesubstan?ally

§  someroughrulesofthumbaretoexpectaposi:onalindextobe2to4Qmesaslargeasanon-posi:onalindex

§  Posi:onalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries

Sec. 2.4.2

22/02/18

Combinedscheme§  Biword indexes and positional indexes can be profitably

combined. §  Many biwords are extremely frequent: Michael Jackson,

Britney Spears etc §  For these biwords, increased speed compared to

positional postings intersection is substantial. §  Combination scheme: Include frequent biwords as

vocabulary terms in the index. Do all other phrases by positional intersection.

22/02/18

Googleindexingsystem

§  Googleischangingthewaytohandleitsindexcon:nuously

§  Seeanhistoryon:hCp://moz.com/google-algorithm-change

22/02/18

Caffeine+Panda,GoogleIndex§  MajorrecentchangeshavebeenCaffeine&Panda§  Caffeine:

§  Oldindexhadseverallayers,someofwhichwererefreshedatafasterratethanothers(theyhaddifferentindexes);themainlayerwouldupdateeverycoupleofweeks(“Googledance”)

§  Caffeineanalyzesthewebinsmallpor:onsandupdatesearchindexonaconQnuousbasis,globally.Asnewpagesarefound,ornewinforma:ononexis:ngpages,theseareaddedstraighttotheindex.

§  Panda:aimstopromotethehighqualitycontentsitebydoomingtherankoflowqualitycontentsites.

§  Note:CaffeineisparteoftheINDEXINGstrategy,notsearching(laterinthiscourse)

22/02/18

SuggestedReading:hCps://arxiv.org/pdf/1712.01208.pdf

22/02/18