web search and information retrieval

1

Web Search and Information Retrieval

2

Definition of information retrieval Information retrieval (IR) is finding material (u

sually documents) of an unstructured nature (usually text) that satisfies an information need within large collections (usually stored on computers)

3

Structured vs unstructured data

Structured data : information in “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match(for text) queries, e.g.,Salary < 60000 AND Manager = Smith.

4

Unstructured data

Typically refers to free text

Allows Keyword-based queries including operators More sophisticated “concept” queries, e.g.,

find all web pages dealing with drug abuse

5

Ultimate Focus of IR

Satisfying user information need Emphasis is on retrieval of information (not data)

Predicting which documents are relevant, and then linearly ranking them.

6SIGIR 2005

Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information

that is relevant to user’s information need and helps him complete a task

7

The classic search model

Corpus

TASK

Info Need

Query

Verbal form

Results

SEARCHENGINE

QueryRefinement

Get rid of mice in a politically correct way

Info about removing micewithout killing them

How do I trap mice alive?

mouse trap

Mis-conception

Mis-translation

Mis-formulation

8

Boolean Queries

Some simple query examples Documents containing the word “Java” Documents containing the word “Java” but not the

word “coffee” Documents containing the phrase “Java beans” or

the term “API” Documents where “Java” and “island” occur in the

same sentence The last two queries are called proximity

queries

9

Before processing the queries… Documents in the collection should be t

okenized in a suitable manner

We need to decide what terms should be put in the index

10

Tokens and Terms

11

Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens

Friends Romans Countrymen

Each such token is now a candidate for an index entry, after further processing Described below

12

Why tokenization is difficult – even in English

Example: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.

Tokenize this sentence

13

One word or two? (or several) fault-finder co-education state-of-the-art data base San Francisco cheap San Francisco-Los Angeles fares

14

Tokenization: language issues Chinese and Japanese have no spaces be

tween words: 莎拉波娃現在居住在美國東南部的佛羅里達。 Not always guaranteed a unique tokenization

15

Ambiguous segmentation in Chinese

The two characters can be treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’.

16

Normalization

Need to “normalize” terms in indexed text as well as query terms into the same form.

Example: We want to match U.S.A. and USA Two general solutions

We most commonly implicitly define equivalence classes of terms.

Alternatively: do asymmetric expansion window → window, windows windows → Windows, windows Windows (no expansion) More powerful, but less efficient

17

Case folding Reduce all letters to lower case

exception: upper case in mid-sentence? Fed vs. fed

Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…

18

Lemmatization Reduce inflectional/variant forms to base

form E.g.,

am, are, is be car, cars, car's, cars' car

the boy's cars are different colors the boy car be different color

Lemmatization implies doing “proper” reduction to dictionary headword form

19

Stemming Definition of stemming: Crude heuristic process that chops

off the ends of words in the hope of achieving what “principled” lemmatization attempts to do with a lot of linguistic knowledge Reduce terms to their “roots” before indexing

“Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress andcompress ar both acceptas equival to compress

20

Porter algorithm

Most common algorithm for stemming English Results suggest that it is at least as good as other stem

ming options Phases are applied sequentially Each phase consists of a set of commands.

Sample command: Delete final “ement” if what remains is longer than 1 character

replacement → replac cement → cement

Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

21

Porter stemmer: A few rules

Rule Example SSES → SS caresses → caress IES → I ponies → poni SS → SS caress → caress S → cats → cat

22

Other stemmers Other stemmers exist, e.g., Lovins stemmer http://w

ww.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

Single-pass, longest suffix removal (about 250 rules)

Full morphological analysis – at most modest benefits for retrieval

Do stemming and other normalizations help? English: very mixed results. Helps recall for some querie

s but harms precision on others E.g., Porter Stemmer equivalence class oper contains all of

operate operating operates operation operative operatives operational

Definitely useful for Spanish, German, Finnish, …

23

Thesauri Handle synonyms and homonyms

Hand-constructed equivalence classes e.g., car = automobile color = colour

Rewrite to form equivalence classes Index such equivalences

When the document contains automobile, index it under car as well (usually, also vice-versa)

Or expand query? When the query contains automobile, look under

car as well

24

Stop words(1) stop words = extremely common words which would

appear to be of little value in helping select documents matching a user need

They have little semantic content Examples: a, an, and, are, as, at, be, by, for, from, has, he,

in, is, it, its, of, on, that, the, to, was, were, will, with

Without suitable compression techniques, it needs a lot of space to index stop words.

Stop word elimination used to be standard in older IR systems.

25

Stop words(2)

But the trend is away from doing this: Good compression techniques mean the space for incl

uding stopwords in a system is very small Good query optimization techniques mean you pay littl

e at query time for including stop words. You need them for:

Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” ‘can’ as a verb is not very useful for keyword queries, bu

t ‘can’ as a noun could be central to a query Most web search engines index stop words

26

The information contains in Doc1&&2 can be represented in the right table.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Start to process Boolean queries(1)

tid did posI 1 1did 1 2enact 1 3julius 1 4caesar 1 5I 1 6was 1 7killed 1 8i' 1 9the 1 10capitol 1 11brutus 1 12killed 1 13me 1 14so 2 1let 2 2it 2 3be 2 4with 2 5caesar 2 6the 2 7noble 2 8brutus 2 9hath 2 10told 2 11you 2 12

caesar 2 13was 2 14

27

Start to process Boolean queries(2) The table mentioned above is called POSTING By using a table like this, it is simple to answer the q

ueries using SQL Documents containing the word “Java” select did from POSTING where tid=‘jave’

Documents containing the word “Java” but not the word “co

ffee” (select did from POSTING where tid= ‘java’) except (select

did from POSTING where tid=‘coffee’)

28

Start to process Boolean queries(3) Documents containing the phrase “Java beans” or the term “API”

With D_JAVA(did, pos) as (select did, pos from POSTING where tid=‘java’),D_BEANS(did, pos) as (select did, pos from POSTING where tid=‘beans’),D_JAVABEANS(did) as

(select D_JAVA.did from D_JAVA, D_BEANS where D_JAVA.did= D_BEANS.did and

D_JAVA.pos+1=D_BEANS.pos),D_API(did) as (select did from POSTING where tid=‘api’),

(select did from D_JAVABEANS) union (select did from D_API)

Documents where “Java” and “island” occur in the same sentence If sentence terminators are well defined, one can keep a sentence cou

nter and maintain sentence positions as well as token positions in the POSTING table.

29

Is it efficient? Although the three-column table makes it

easy to write keyword queries, it wastes a great deal of space.

To reduce the storage space Document-term matrix -> term-document matrix Inverted index

For each term T, we must store a list of all documents that contain T.

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64 128

13 16

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 341 2 3 5 8 13 21 34

2 4 8 16 32 64 1282 4 8 16 32 64 128

13 16

30

Inverted index: the basic concept

31

Inverted index Linked lists generally preferred to arrays

Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers

Brutus

CalpurniaCaesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

Sorted by docID

Posting

3232

Query processing: AND Consider processing the query:

Brutus AND Caesar Locate Brutus in the Dictionary;

Retrieve its postings. Locate Caesar in the Dictionary;

Retrieve its postings. “Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13

21

Brutus

Caesar

3333

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

The merge Walk through the two postings

simultaneously, in time linear in the total number of postings entries

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

34

Sequence of (Modified token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Index construction

35

Sort by terms.

External sort is used N-way merge sort

Large scale indexer Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Core indexing step.

36

Multiple term entries in a single document are merged.

Frequency information is added.

Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Why frequency?Will discuss later.

37

The result is split into a Dictionary file and a Postings file.

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

38

Distributed indexing For web-scale indexing (don’t try this at

home!):must use a distributed computing cluster

Individual machines are fault-prone Can unpredictably slow down or fail

How do we exploit such a pool of machines?

39

Google data centers Google data centers mainly contain

commodity machines. Data centers are distributed around the

world. Estimate: a total of 1 million servers, 3 million

processors/cores (Gartner 2007) Estimate: Google installs 100,000 servers

each quarter. Based on expenditures of 200–250 million dollars

per year

40

Distributed indexing Maintain a master machine directing the

indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle

machine from a pool.

41

Parallel tasks We will use two sets of parallel tasks

Parsers Inverters

Break the input document corpus into splits Each split is a subset of documents

42

Parsers Master assigns a split to an idle parser machi

ne Parser reads a document at a time and emits

(term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms’ first lett

ers (e.g., a-f, g-p, q-z) – here j=3.

Now to complete the index inversion

43

Inverters An inverter collects all (term,doc) pairs (= pos

tings) for one term-partition. Sorts and writes to postings lists

44

Data flow

splits

Parser

Parser

Parser

Master

a-f g-p q-z

a-f g-p q-z

a-f g-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Mapphase

Segment files Reducephase

45

MapReduce The index construction algorithm we just des

cribed is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a

robust and conceptually simple framework for distributed computing …

… without having to write code for the distribution part.

46

MapReduce Index construction was just one phase. Another phase: transforming a term-partitione

d index into document-partitioned index. Term-partitioned: one machine handles a subrang

e of terms Document-partitioned: one machine handles a su

brange of documents (As we discuss in the web part of the course)

most search engines use a document-partitioned index … better load balancing, etc.)

47

Dynamic indexing Up to now, we have assumed that collections

are static. They rarely are:

Documents come in over time and need to be inserted.

Documents are deleted and modified. This means that the dictionary and postings

lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary

48

Simplest approach Maintain “big” main index Insertions

New docs go into “small” auxiliary index Search across both, merge results

Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this

invalidation bit-vector Periodically, re-index into one main index

49

Dynamic indexing at search engines All the large search engines now do dynamic

indexing Their indices have frequent incremental chan

ges News items, new topical web pages

But (sometimes/typically) they also periodically reconstruct the index from scratch Query processing is then switched to the new ind

ex, and the old index is then deleted

50

Something about dictionary

51

A naïve dictionary An array of struct:

char[20] int Postings *

20 bytes 4/8 bytes 4/8 bytes How do we quickly look up elements at query time? How do we store a dictionary in memory efficiently?

52

Dictionary data structures Two main choices:

Hash table Tree

Some IR systems use hashes, some trees

53

Hashes Each vocabulary term is hashed to an integer

(We assume you’ve seen hashtables before) Pros:

Lookup is faster than for a tree: O(1) Cons:

No easy way to find minor variants: judgment/judgement

No prefix search [tolerant retrieval] If vocabulary keeps going, need to occasionally d

o the expensive operation of rehashing everything

54

Trees Simplest: binary tree More usual: B+-tree

Pros: Solves the prefix problem (terms starting with hyp)

Cons: Slower: O(log M) [and this requires balanced tre

e] Rebalancing binary trees is expensive

But B-trees mitigate the rebalancing problem

55

Other issues Wild-card query

Example mon*: find all docs containing any word beginning with “

mon”

Spell correction Two main flavors:

Isolated word Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words

e.g., from form Context-sensitive

Look at surrounding words, e.g., I flew form Heathrow to Narita.

56

Why compress the dictionary Must keep in memory

Search begins with the dictionary

Embedded/mobile devices

57

Dictionary storage - first cut Array of fixed-width entries

~400,000 terms; 28 bytes/term = 11.2 MB.

Terms Freq. Postings ptr.

a 656,265

aachen 65

…. ….

zulu 221

Dictionary searchstructure

20 bytes 4 bytes each

58

Fixed-width terms are wasteful

Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. And we still can’t handle

supercalifragilisticexpialidocious.

Ave. dictionary word in English: ~8 characters How do we use ~8 characters per dictionary

term?

59

Compressing the term list: Dictionary-as-a-String

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

Total string length =400K x 8B = 3.2MB

Pointers resolve 3.2Mpositions: log23.2M =

22bits = 3bytes

Store dictionary as a (long) string of characters:

Pointer to next word shows end of current wordHope to save up to 60% of dictionary space.

60

Blocking Store pointers to every kth term string.

Example below: k=4. Need to store term lengths (1 extra byte)

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

7

Save 9 bytes on 3 pointers.

Lose 4 bytes onterm lengths.

61

Net Where we used 3 bytes/pointer without

blocking 3 x 4 = 12 bytes for k=4 pointers,

now we use 3+4=7 bytes for 4 pointers.

Shaved another ~0.5MB; can save more with larger k.

Why not go with larger k?

62

Dictionary search without blocking

Assuming each dictionary term equally likely in query (not really so in practice!), average number of comparisons = (1+2*2+4*3+4)/8 ~2.6

63

Dictionary search with blocking

Binary search down to 4-term block; Then linear search through terms in block.

Blocks of 4 (binary tree), avg. = (1+2*2+2*3+2*4+5)/8 = 3 compares

64

Front coding Front-coding:

Sorted words commonly have long common prefix – store differences only

(for last k-1 in a block of k)

8automata8automate9automatic10automation

8automat*a1e2ic3ion

Encodes automat Extra lengthbeyond automat.

Begins to resemble general string compression.

65

Appendix

66

B+-tree

Records must be ordered over an attribute

Queries: exact match and range queries over the indexed attribute: “find the name of the student with ID=087-34-7892” or “find all students with gpa between 3.00 and 3.5”

67

B+-tree:properties

Insert/delete at log F (N/B) cost; keep tree height-balanced. (F = fanout)

Two types of nodes: index nodes and data nodes; each node is 1 page (disk based method)

68

57

81

95

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95

Index node

69

Data node5

7

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 8

5

From non-leaf node

to next leaf

in sequence

70

EX: B+ Tree of order 3.

(a) Initial tree

60

80

20 , 40

205,10 6040,50 80,100

Index level

Data level

71

Query Example

Root

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

Range[32, 160]

web search and information retrieval

Documents

retrieval of information

users information

query terms

word coffeedocuments

word javadocuments

sentenceone word

cars carthe boys cars

web search