web search and information retrieval
DESCRIPTION
Web Search and Information Retrieval. Definition of information retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need within large collections (usually stored on computers). - PowerPoint PPT PresentationTRANSCRIPT
1
Web Search and Information Retrieval
2
Definition of information retrieval Information retrieval (IR) is finding material (u
sually documents) of an unstructured nature (usually text) that satisfies an information need within large collections (usually stored on computers)
3
Structured vs unstructured data
Structured data : information in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000Ivy Smith
Typically allows numerical range and exact match(for text) queries, e.g.,Salary < 60000 AND Manager = Smith.
4
Unstructured data
Typically refers to free text
Allows Keyword-based queries including operators More sophisticated “concept” queries, e.g.,
find all web pages dealing with drug abuse
5
Ultimate Focus of IR
Satisfying user information need Emphasis is on retrieval of information (not data)
Predicting which documents are relevant, and then linearly ranking them.
6SIGIR 2005
Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information
that is relevant to user’s information need and helps him complete a task
7
The classic search model
Corpus
TASK
Info Need
Query
Verbal form
Results
SEARCHENGINE
QueryRefinement
Get rid of mice in a politically correct way
Info about removing micewithout killing them
How do I trap mice alive?
mouse trap
Mis-conception
Mis-translation
Mis-formulation
8
Boolean Queries
Some simple query examples Documents containing the word “Java” Documents containing the word “Java” but not the
word “coffee” Documents containing the phrase “Java beans” or
the term “API” Documents where “Java” and “island” occur in the
same sentence The last two queries are called proximity
queries
9
Before processing the queries… Documents in the collection should be t
okenized in a suitable manner
We need to decide what terms should be put in the index
10
Tokens and Terms
11
Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens
Friends Romans Countrymen
Each such token is now a candidate for an index entry, after further processing Described below
12
Why tokenization is difficult – even in English
Example: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.
Tokenize this sentence
13
One word or two? (or several) fault-finder co-education state-of-the-art data base San Francisco cheap San Francisco-Los Angeles fares
14
Tokenization: language issues Chinese and Japanese have no spaces be
tween words: 莎拉波娃現在居住在美國東南部的佛羅里達。 Not always guaranteed a unique tokenization
15
Ambiguous segmentation in Chinese
The two characters can be treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’.
16
Normalization
Need to “normalize” terms in indexed text as well as query terms into the same form.
Example: We want to match U.S.A. and USA Two general solutions
We most commonly implicitly define equivalence classes of terms.
Alternatively: do asymmetric expansion window → window, windows windows → Windows, windows Windows (no expansion) More powerful, but less efficient
17
Case folding Reduce all letters to lower case
exception: upper case in mid-sentence? Fed vs. fed
Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…
18
Lemmatization Reduce inflectional/variant forms to base
form E.g.,
am, are, is be car, cars, car's, cars' car
the boy's cars are different colors the boy car be different color
Lemmatization implies doing “proper” reduction to dictionary headword form
19
Stemming Definition of stemming: Crude heuristic process that chops
off the ends of words in the hope of achieving what “principled” lemmatization attempts to do with a lot of linguistic knowledge Reduce terms to their “roots” before indexing
“Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compress andcompress ar both acceptas equival to compress
20
Porter algorithm
Most common algorithm for stemming English Results suggest that it is at least as good as other stem
ming options Phases are applied sequentially Each phase consists of a set of commands.
Sample command: Delete final “ement” if what remains is longer than 1 character
replacement → replac cement → cement
Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.
21
Porter stemmer: A few rules
Rule Example SSES → SS caresses → caress IES → I ponies → poni SS → SS caress → caress S → cats → cat
22
Other stemmers Other stemmers exist, e.g., Lovins stemmer http://w
ww.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Single-pass, longest suffix removal (about 250 rules)
Full morphological analysis – at most modest benefits for retrieval
Do stemming and other normalizations help? English: very mixed results. Helps recall for some querie
s but harms precision on others E.g., Porter Stemmer equivalence class oper contains all of
operate operating operates operation operative operatives operational
Definitely useful for Spanish, German, Finnish, …
23
Thesauri Handle synonyms and homonyms
Hand-constructed equivalence classes e.g., car = automobile color = colour
Rewrite to form equivalence classes Index such equivalences
When the document contains automobile, index it under car as well (usually, also vice-versa)
Or expand query? When the query contains automobile, look under
car as well
24
Stop words(1) stop words = extremely common words which would
appear to be of little value in helping select documents matching a user need
They have little semantic content Examples: a, an, and, are, as, at, be, by, for, from, has, he,
in, is, it, its, of, on, that, the, to, was, were, will, with
Without suitable compression techniques, it needs a lot of space to index stop words.
Stop word elimination used to be standard in older IR systems.
25
Stop words(2)
But the trend is away from doing this: Good compression techniques mean the space for incl
uding stopwords in a system is very small Good query optimization techniques mean you pay littl
e at query time for including stop words. You need them for:
Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” ‘can’ as a verb is not very useful for keyword queries, bu
t ‘can’ as a noun could be central to a query Most web search engines index stop words
26
The information contains in Doc1&&2 can be represented in the right table.
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Start to process Boolean queries(1)
tid did posI 1 1did 1 2enact 1 3julius 1 4caesar 1 5I 1 6was 1 7killed 1 8i' 1 9the 1 10capitol 1 11brutus 1 12killed 1 13me 1 14so 2 1let 2 2it 2 3be 2 4with 2 5caesar 2 6the 2 7noble 2 8brutus 2 9hath 2 10told 2 11you 2 12
caesar 2 13was 2 14
27
Start to process Boolean queries(2) The table mentioned above is called POSTING By using a table like this, it is simple to answer the q
ueries using SQL Documents containing the word “Java” select did from POSTING where tid=‘jave’
Documents containing the word “Java” but not the word “co
ffee” (select did from POSTING where tid= ‘java’) except (select
did from POSTING where tid=‘coffee’)
28
Start to process Boolean queries(3) Documents containing the phrase “Java beans” or the term “API”
With D_JAVA(did, pos) as (select did, pos from POSTING where tid=‘java’),D_BEANS(did, pos) as (select did, pos from POSTING where tid=‘beans’),D_JAVABEANS(did) as
(select D_JAVA.did from D_JAVA, D_BEANS where D_JAVA.did= D_BEANS.did and
D_JAVA.pos+1=D_BEANS.pos),D_API(did) as (select did from POSTING where tid=‘api’),
(select did from D_JAVABEANS) union (select did from D_API)
Documents where “Java” and “island” occur in the same sentence If sentence terminators are well defined, one can keep a sentence cou
nter and maintain sentence positions as well as token positions in the POSTING table.
29
Is it efficient? Although the three-column table makes it
easy to write keyword queries, it wastes a great deal of space.
To reduce the storage space Document-term matrix -> term-document matrix Inverted index
For each term T, we must store a list of all documents that contain T.
Brutus
Calpurnia
Caesar
1 2 3 5 8 13 21 34
2 4 8 16 32 64 128
13 16
Brutus
Calpurnia
Caesar
1 2 3 5 8 13 21 341 2 3 5 8 13 21 34
2 4 8 16 32 64 1282 4 8 16 32 64 128
13 16
30
Inverted index: the basic concept
31
Inverted index Linked lists generally preferred to arrays
Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers
Brutus
CalpurniaCaesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings lists
Sorted by docID
Posting
3232
Query processing: AND Consider processing the query:
Brutus AND Caesar Locate Brutus in the Dictionary;
Retrieve its postings. Locate Caesar in the Dictionary;
Retrieve its postings. “Merge” the two postings:
128
34
2 4 8 16 32 64
1 2 3 5 8 13
21
Brutus
Caesar
3333
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
The merge Walk through the two postings
simultaneously, in time linear in the total number of postings entries
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar2 8
If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.
34
Sequence of (Modified token, Document ID) pairs.
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
Index construction
35
Sort by terms.
External sort is used N-way merge sort
Large scale indexer Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Core indexing step.
36
Multiple term entries in a single document are merged.
Frequency information is added.
Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Why frequency?Will discuss later.
37
The result is split into a Dictionary file and a Postings file.
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
38
Distributed indexing For web-scale indexing (don’t try this at
home!):must use a distributed computing cluster
Individual machines are fault-prone Can unpredictably slow down or fail
How do we exploit such a pool of machines?
39
Google data centers Google data centers mainly contain
commodity machines. Data centers are distributed around the
world. Estimate: a total of 1 million servers, 3 million
processors/cores (Gartner 2007) Estimate: Google installs 100,000 servers
each quarter. Based on expenditures of 200–250 million dollars
per year
40
Distributed indexing Maintain a master machine directing the
indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle
machine from a pool.
41
Parallel tasks We will use two sets of parallel tasks
Parsers Inverters
Break the input document corpus into splits Each split is a subset of documents
42
Parsers Master assigns a split to an idle parser machi
ne Parser reads a document at a time and emits
(term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms’ first lett
ers (e.g., a-f, g-p, q-z) – here j=3.
Now to complete the index inversion
43
Inverters An inverter collects all (term,doc) pairs (= pos
tings) for one term-partition. Sorts and writes to postings lists
44
Data flow
splits
Parser
Parser
Parser
Master
a-f g-p q-z
a-f g-p q-z
a-f g-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Mapphase
Segment files Reducephase
45
MapReduce The index construction algorithm we just des
cribed is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a
robust and conceptually simple framework for distributed computing …
… without having to write code for the distribution part.
46
MapReduce Index construction was just one phase. Another phase: transforming a term-partitione
d index into document-partitioned index. Term-partitioned: one machine handles a subrang
e of terms Document-partitioned: one machine handles a su
brange of documents (As we discuss in the web part of the course)
most search engines use a document-partitioned index … better load balancing, etc.)
47
Dynamic indexing Up to now, we have assumed that collections
are static. They rarely are:
Documents come in over time and need to be inserted.
Documents are deleted and modified. This means that the dictionary and postings
lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary
48
Simplest approach Maintain “big” main index Insertions
New docs go into “small” auxiliary index Search across both, merge results
Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this
invalidation bit-vector Periodically, re-index into one main index
49
Dynamic indexing at search engines All the large search engines now do dynamic
indexing Their indices have frequent incremental chan
ges News items, new topical web pages
But (sometimes/typically) they also periodically reconstruct the index from scratch Query processing is then switched to the new ind
ex, and the old index is then deleted
50
Something about dictionary
51
A naïve dictionary An array of struct:
char[20] int Postings *
20 bytes 4/8 bytes 4/8 bytes How do we quickly look up elements at query time? How do we store a dictionary in memory efficiently?
52
Dictionary data structures Two main choices:
Hash table Tree
Some IR systems use hashes, some trees
53
Hashes Each vocabulary term is hashed to an integer
(We assume you’ve seen hashtables before) Pros:
Lookup is faster than for a tree: O(1) Cons:
No easy way to find minor variants: judgment/judgement
No prefix search [tolerant retrieval] If vocabulary keeps going, need to occasionally d
o the expensive operation of rehashing everything
54
Trees Simplest: binary tree More usual: B+-tree
Pros: Solves the prefix problem (terms starting with hyp)
Cons: Slower: O(log M) [and this requires balanced tre
e] Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem
55
Other issues Wild-card query
Example mon*: find all docs containing any word beginning with “
mon”
Spell correction Two main flavors:
Isolated word Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words
e.g., from form Context-sensitive
Look at surrounding words, e.g., I flew form Heathrow to Narita.
56
Why compress the dictionary Must keep in memory
Search begins with the dictionary
Embedded/mobile devices
57
Dictionary storage - first cut Array of fixed-width entries
~400,000 terms; 28 bytes/term = 11.2 MB.
Terms Freq. Postings ptr.
a 656,265
aachen 65
…. ….
zulu 221
Dictionary searchstructure
20 bytes 4 bytes each
58
Fixed-width terms are wasteful
Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. And we still can’t handle
supercalifragilisticexpialidocious.
Ave. dictionary word in English: ~8 characters How do we use ~8 characters per dictionary
term?
59
Compressing the term list: Dictionary-as-a-String
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
Total string length =400K x 8B = 3.2MB
Pointers resolve 3.2Mpositions: log23.2M =
22bits = 3bytes
Store dictionary as a (long) string of characters:
Pointer to next word shows end of current wordHope to save up to 60% of dictionary space.
60
Blocking Store pointers to every kth term string.
Example below: k=4. Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
7
Save 9 bytes on 3 pointers.
Lose 4 bytes onterm lengths.
61
Net Where we used 3 bytes/pointer without
blocking 3 x 4 = 12 bytes for k=4 pointers,
now we use 3+4=7 bytes for 4 pointers.
Shaved another ~0.5MB; can save more with larger k.
Why not go with larger k?
62
Dictionary search without blocking
Assuming each dictionary term equally likely in query (not really so in practice!), average number of comparisons = (1+2*2+4*3+4)/8 ~2.6
63
Dictionary search with blocking
Binary search down to 4-term block; Then linear search through terms in block.
Blocks of 4 (binary tree), avg. = (1+2*2+2*3+2*4+5)/8 = 3 compares
64
Front coding Front-coding:
Sorted words commonly have long common prefix – store differences only
(for last k-1 in a block of k)
8automata8automate9automatic10automation
8automat*a1e2ic3ion
Encodes automat Extra lengthbeyond automat.
Begins to resemble general string compression.
65
Appendix
66
B+-tree
Records must be ordered over an attribute
Queries: exact match and range queries over the indexed attribute: “find the name of the student with ID=087-34-7892” or “find all students with gpa between 3.00 and 3.5”
67
B+-tree:properties
Insert/delete at log F (N/B) cost; keep tree height-balanced. (F = fanout)
Two types of nodes: index nodes and data nodes; each node is 1 page (disk based method)
68
57
81
95
to keys to keys to keys to keys
< 57 57 k<81 81k<95 95
Index node
69
Data node5
7
81
95
To r
eco
rd
wit
h k
ey 5
7
To r
eco
rd
wit
h k
ey 8
1
To r
eco
rd
wit
h k
ey 8
5
From non-leaf node
to next leaf
in sequence
70
EX: B+ Tree of order 3.
(a) Initial tree
60
80
20 , 40
205,10 6040,50 80,100
Index level
Data level
71
Query Example
Root
100
120
150
180
30
3 5 11
30
35
100
101
110
120
130
150
156
179
180
200
Range[32, 160]