inverted indexing for text retrieval

Inverted Indexing for Text Retrieval

Chapter 4 Lin and Dyer

Introduction

• Web search is a quintessential large-data problem.• So are any number of problems in genomics.

– Google, amazon (aws) all are involved in research and discovery in this area

• Web search or full text search depends on a data structure called inverted index.

• Web search problem breaks down into three major components:– Gathering the web content (crawling) (like project 1)– Construction of inverted index (indexing) – Ranking the documents given a query (retrieval) (exam 1)

Issues with these components

• Crawling and indexing have similar characteristics: resource consumption is high

• Typically offline batch processing except of course on twitter model

• There are many requirements for a web crawler or in general a data aggregator..– Etiquette, bandwidth resources, multilingual,

duplicate contents, frequency of changes…– How often to collect: too few may miss important

updates, too often may have too much info

04/19/2023 4

• Start with a “seed” URL , say wikipedia page, and start collecting the content by following the links in the seed page; the depth of traversal is also specified by the input

• What are the issues?• See page 67

Web Crawling

Retrieval

• Retrieval is a online problem that demands stringent timings: sub-second response times.– Concurrent queries– Query latency– Load on the servers– Other circumstances: day of the day– Resource consumption can be spikey or highly variable

• Resource requirement for indexing is more predictable

Indexes

• Regular index: Document terms• Inverted index termdocuments• Example: term1 {d1,p}, {d2, p}, {d23, p} term2 {d2, p}. {d34, p} term3 {d6, p}, {d56, p}, {d345, p}Where d is the doc id, p is the payload (example for payload: term frequency… this can be blank too)

04/19/2023 7

• Inverted index consists of postings lists, one associated with each term that appears in the corpus.

• <t, posting>n

• <t, <docid, tf> >n

• <t, <docid, tf, other info>>n

• Key, value pair where the key is the term (word) and the value is the docid, followed by “payload”

• Payload can be empty for simple index• Payload can be complex: provides such details as co-occurrences, additional linguistic

processing, page rank of the doc, etc.• <t2, <d1, d4, d67, d89>>• <t3, <d4, d6, d7, d9, d22>>• Document numbering typically do not have semantic content but docs from the same

corpus are numbered together or the numbers could be assigned based on page ranks.

Inverted Index

Retrieval

• Once the inverted index is developed, when a query comes in, retrieval involves fetching the appropriate docs.

• The docs are ranked and top k docs are listed.• It is good to have the inverted index in memory.• If not , some queries may involve random disk

access for decoding of postings.• Solution: organize the disk accesses so that

random seeks are minimized.

Pseudo Code

Pseudo code Baseline implementation value-key conversion pattern implementation…

04/19/2023 10

• Input to the mapper consists of docid and actual content.• Each document is analyzed and broken down into terms.• Processing pipeline assuming HTML docs:

• Strip HTML tags• Strip Javascript code• Tokenize using a set of delimiters• Case fold• Remove stop words (a, an the…)• Remove domain-specific stop works• Stem different forms (..ing, ..ed…, dogs – dog)

Inverted Index: Baseline Implementation using MR

Baseline implementation

procedure map (docid n, doc d) H new Associative array for all terms in doc d H{t} H{t} + 1 for all term in H emit(term t, posting <n, H{t}>)

Reducer for baseline implmentation

procedure reducer( term t, postings[<n1, f1> <n2, f2>, …]) P new List for all posting <a,f> in postings Append (P, <a,f>) Sort (P) // sorted by docid Emit (term t, postings P)

Shuffle and sort phase

• Is a very large group by term of the postings• Lets look at a toy example• Fig. 4.3 some items are incorrect in the figure

04/19/2023 14

class Mapperprocedure Map(docid n; doc d) H =new AssociativeArray for all term t in doc d do H(t) H(t) + 1 for all term t in H do Emit(term t; posting (n,H[t]) class Reducer procedure Reduce(term t; postings [hn1; f1i; hn2; f2i : : :]) P = new List for all posting (t,f) in postings [(n1,f1); (n2, f2) : : :] do Append(P, (t, f)) Sort(P) Emit(term t; postings P)

Baseline MR for II

Revised Implementation

• Issue: MR does not guarantee sorting order of the values.. Only by keys

• So the sort in the reducer is an expensive operation esp. if the docs cannot be held in memory.

• Lets check a revised solution • (term t, posting<docid, f>) to• (term<t,docid>, tf f)

04/19/2023 16

• From Baseline to an improved version• Observe the sort done by the Reducer. Is there any way to push this into

the MR runtime?• Instead of

– (term t, posting<docid, f>)• Emit

– (tuple<t, docid>, tf f)• This is our previously studied value-key conversion design pattern• This switching ensures the keys arrive in order at the reducer• Small memory foot print; less buffer space needed at the reducer• See fig.4.4

Inverted Index: Revised implementation

Modified mapper

Map (docid n, doc d)H new AssociativeArrayFor all terms t in doc H{t} H{t} + 1For all terms in H emit (tuple<t,n>, H{t})

Modified ReducerInitialize tprev 0 P new PostingList

method reduce (tuple <t,n>, tf [f1, ..]) if t # tprev ^ tprev # 0{ emit (term t, posting P); reset P; }P.add(<n,f>)tprev t

Close emit(term t, postings P)

04/19/2023 19

Improved MR for II

class Mapper method Map(docid n; doc d) H = new AssociativeArray for all term t in doc d do H[t] = H[t] + 1 for all term t in H do Emit(tuple <t; n>, tf H[t])

class Reducer method Initialize tprev = 0; P = new PostingsList method Reduce(tuple <t, n>; tf [f]) if t <> tprev ^ tprev <> 0; then Emit(term t; postings P) P:Reset() P:Add(<n, f>) tprev = t method Close

Other modifications

• Partitioner and shuffle have to deliver all related <key, value> to same reducer

• Custom partitioner so that all terms t go to the same reducer.

• Lets go through a numerical example

04/19/2023 21

• While MR is great for indexing, it is not great for retrieval.

What about retrieval?

04/19/2023 22

• Section 4.5• (5,2), (7,3), (12,1), (49,1), (51,2)…• (5,2), (2,3), (5,1), (37,1), (2,2)…

Index compression for space

Miscellaneous Stuff

• How to MR Spam Filtering (Naïve Bayes solution) discussed in Ch.4 DDS? In training the model.

• Write solution in the form of your main workflow configuration.

• Prior is What is random probability of x occurring? Eg. What is the probability that the next person who walks into the class is a female?

NIH Solicitation in Big Data (2014)

• ..• This opportunity targets four topic areas of

high need for researchers working with biomedical Big Data, 1. Data Compression/Reduction 2. Data Provenance 3. Data Visualization 4. Data Wrangling

Odds Ratio Example from 4/16/2014 news article

• Woods is still favored to with the U.S. Open. He and Rory McIlroy are each 10/1 favorites on online betting site, Bovada. Adam Scott has the next best odds at 12/1…..

• How to interpret this? = = =

• Woods is also the favorite to win the Open Championship at Hoylake in July. He's 7/1 there. =

http://sports.bovada.lv/sports-betting/golf-futures.jsp

http://msn.foxsports.com/golf/player/rory-mcilroy/1276151

http://msn.foxsports.com/golf/player/rory-mcilroy/1276151

http://msn.foxsports.com/golf/player/adam-scott/1277146

inverted indexing for text retrieval

Documents

inverted indexretrievalonce

web search problem

text search

html docs

seed page

blank tooinverted index

web content crawling

wikipedia page