document preprocessing and indexing si650: information retrieval

63
2010 © University of Michigan Document Preprocessing and Indexing SI650: Information Retrieval Winter 2010 School of Information University of Michigan 1

Upload: brand

Post on 12-Jan-2016

101 views

Category:

Documents


7 download

DESCRIPTION

Document Preprocessing and Indexing SI650: Information Retrieval. Winter 2010 School of Information University of Michigan. Typical IR system architecture. documents. INDEXING. Query Rep. query. Doc Rep. User. Ranking. results. SEARCHING. INTERFACE. Feedback. judgments. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Document Preprocessing and Indexing

SI650: Information Retrieval

Winter 2010

School of Information

University of Michigan

1

Page 2: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Typical IR system architecture

User

query

judgments

documents

results

QueryRep

Doc

Rep

Ranking

Feedback

INDEXING

SEARCHING

QUERY MODIFICATION

INTERFACE

- From ChengXiang Zhai’s slides

0 2 0 0 0

0 0 1 1 0

0 1 0 3 0

0 0 0 0 1

1 1 0 0 0

0 1 1 0 0

2

Page 3: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Overload of text content

Content Type

Published Content

Professional web content

User generated content

Private text content

Amount / day 3-4G ~ 2G 8-10G ~ 3T

- Ramakrishnan and Tomkins 2007

3

Page 4: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Data volume behind online information systems

~750k /day

~3M day

~150k /day

1M

10B

6M

~100B

4

Page 5: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

IR Winter 2010

Automated indexing/labeling

Storing, indexing and searching text.

Inverted indexes.

Page 6: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Handling large collections

• Life is good when every document is mapped into a vector of words, but …

• Consider N = 1 million documents, each with about 1000 words.

• Avg 6 bytes/word including spaces/punctuation – 6GB of data in the documents.

• Say there are M = 500K distinct terms among these.

Sec. 1.1

6

Page 7: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Storage issue

• 500K x 1M matrix has half-a-trillion elements.– 4 bytes for an integer– 500K x 1M x 4 = 2T (your laptop would fail)– 500K x 100G x 4 = 2*105 T (challenging even for

google)

• But it has no more than one billion positive numbers.– matrix is extremely sparse.– 1000 x 1M x 4 = 4G

• What’s a better representation?

Sec. 1.1

7

Page 8: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Indexing

• Indexing = Convert documents to data structures that enable fast search

• Inverted index is the dominating indexing method (used by all search engines)

• Other indices (e.g., document index) may be needed for feedback

8

Page 9: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Inverted index

• Instead of an incidence vector, use a posting table

• CLEVELAND: D1, D2, D6• OHIO: D1, D5, D6, D7• Use linked lists to be able to insert new

document postings in order and to remove existing postings.

• More efficient than scanning docs (why?)

9

Page 10: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Inverted index

• Fast access to all docs containing a given term (along with frequency and position information)

• For each term, we get a list of tuples – (docID, freq, pos).

• Given a query, we can fetch the lists for all query terms and work on the involved documents.– Boolean query: set operation– Natural language query: term weight summing

• Keep everything sorted! This gives you a logarithmic improvement in access.

10

Page 11: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Inverted index - example

• For each term t, we must store a list of all documents that contain t.– Identify each by a docID, a document serial number

Sec. 1.2

Dictionary Postings

PostingPosting

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45173

2 31

174

54101

- From Chris Manning’s slides

11

Page 12: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Inverted index - example

This is a sample

document

with one sample

sentence

Doc 1

This is another

sample document

Doc 2

Dictionary Postings

Term # docs

Total freq

This 2 2

is 2 2

sample 2 3

another 1 1

… … …

Doc id Freq Pos

1 1 1

2 1 1

1 1 2

2 1 2

1 2 4, 8

2 1 4

2 1 3

… … …

… … …

- From ChengXiang Zhai’s slides

12

Page 13: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Basic operations on inverted indexes

• Conjunction (AND) – iterative merge of the two postings: O(x+y)

• Disjunction (OR) – very similar• Negation (NOT) – can we still do it in O(x+y)?

– Example: MICHIGAN AND NOT OHIO– Example: MICHIGAN OR NOT OHIO

• Recursive operations• Optimization: start with the smallest sets

13

Page 14: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Data structures for inverted index

• Dictionary: modest size– Needs fast random access– Preferred to be in memory– Hash table, B-tree, trie, …

• Postings: huge– Sequential access is expected – Can stay on disk– May contain docID, term freq., term pos, etc– Compression is desirable

14

Page 15: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Constructing inverted index

• The main difficulty is to build a huge index with limited memory

• Memory-based methods: not usable for large collections

• Sort-based methods: – Step 1: collect local (termID, docID, freq) tuples– Step 2: sort local tuples (to make “runs”)– Step 3: pair-wise merge runs– Step 4: Output inverted file

15

Page 16: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Sort-based inversion

...

Term Lexicon:

the 1

cold 2

days 3

a 4

...

DocID

Lexicon:

doc1 1

doc2 2

doc3 3

...

doc1

doc1

doc300

<1,1,3>

<1,2,2>

<2,1,2>

<2,4,3>

...

<1,5,3>

<1,6,2>

<1,299,3>

<1,300,1>

...

Sort by term-id

“Local” sort

<1,1,3>

<1,2,2>

<1,5,2>

<1,6,3>

...

<1,300,3>

<2,1,2>

<5000,299,1>

<5000,300,1>

...Merge sort

All info about term 1

<1,1,3>

<2,1,2>

<3,1,1>

...

<1,2,2>

<3,2,3>

<4,2,2>

<1,300,3>

<3,300,1>

...

Sort by doc-id

Parse & Count16

Page 17: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

IR Winter 2010

Document preprocessing.

Tokenization. Stemming.

The Porter algorithm.

Page 18: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Can we make it even better?

• Index term selection/normalization– Reduce the size of the vocabulary

• Index compression– Reduce the space of storage

18

Page 19: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan 19

Should we index every term?

• How big is English?– Dictionary Marketing– Education (Testing of Vocabulary Size)

– Psychology– Statistics– Linguistics

• Two Very Different Answers– Chomsky: language is infinite– Shannon: 1.25 bits per character

• Should we care about a term– If no body uses it as a query?

Page 20: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

What is a good indexing term?

• Specific (phrases) or general (single word)?• Luhn found that words with middle frequency are

most useful– Not too specific (low utility, but still useful!)– Not too general (lack of discrimination, stop words)– Stop word removal is common, but rare words are

kept

• All words or a (controlled) subset? When term weighting is used, it is a matter of weighting not selecting of indexing terms (more later)

20

Page 21: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Term selection for indexing

• Manual: e.g., Library of Congress subject headings, MeSH

• Automatic: e.g., TF*IDF based

21

Page 22: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

LOC subject headings

http://www.loc.gov/catdir/cpso/lcco/lcco.html

A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

22

Page 23: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

MedicineCLASS R - MEDICINE

Subclass R

R5-920 Medicine (General)

R5-130.5 General works

R131-687 History of medicine. Medical expeditions

R690-697 Medicine as a profession. Physicians

R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc.

R711-713.97 Directories

R722-722.32 Missionary medicine. Medical missionaries

R723-726 Medical philosophy. Medical ethics

R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying

R727-727.5 Medical personnel and the public. Physician and the public

R728-733 Practice of medicine. Medical practice economics

R735-854 Medical education. Medical schools. Research

R855-855.5 Medical technology

R856-857 Biomedical engineering. Electronics. Instrumentation

R858-859.7 Computer applications to medicine. Medical informatics

R864 Medical records

R895-920 Medical physics. Medical radiology. Nuclear medicine23

Page 24: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Automatic term selection methods

• TF*IDF: pick terms with the highest TF*IDF scores

• Centroid-based: pick terms that appear in the centroid with high scores

• The maximal marginal relevance principle (MMR)

• Related to summarization, snippet generation

24

Page 25: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Non-English languages

– Arabic:

– Japanese:

– Chinese: 信息檢索

– German: Lebensversicherungsgesellschaftsangesteller

كتاب

この本は重い。

25

Page 26: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Document preprocessing

• What should we use to index?• Dealing with formatting and encoding issues• Hyphenation, accents, stemming, capitalization• Tokenization:

– Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc, can’t

– Example: “The New York-Los Angeles flight”

26

Page 27: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Document preprocessing

• Normalization:– Casing (cat vs. CAT)– Stemming (computer, computation)– String matching– Labeled/labelled, extraterrestrial/extra-terrestrial/extra

terrestrial, Qaddafi/Kadhafi/Ghadaffi

• Index reduction– Dropping stop words (“and”, “of”, “to”)– Problematic for “to be or not to be”

27

Page 28: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Tokenization

• Normalize lexical units: Words with similar meanings should be mapped to the same indexing term

• Stemming: Mapping all inflectional forms of words to the same root form, e.g.– computer -> compute– computation -> compute– computing -> compute (but king->k?)

• Porter’s Stemmer is popular for English

28

Page 29: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Porter’s algorithm

Example: the word “duplicatable”

duplicat rule from step 4duplicate rule from step 1b1duplic rule from step 3

The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.

29

Page 30: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Porter’s algorithm

Computable Comput

Intervention Intervent

Retrieval Retriev

Document Docum

Representing Repres

Representative Repres

30

Page 31: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Links

• http://maya.cs.depaul.edu/~classes/ds575/porter.html

• http://www.tartarus.org/~martin/PorterStemmer/def.txt

31

Page 32: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

IR Winter 2010

…Approximate string matching

Page 33: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Approximate string matching

• The Soundex algorithm (Odell and Russell)

• Uses:– spelling correction– hash function– non-recoverable

33

Page 34: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

The Soundex algorithm

1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions

2. Assign the following numbers to the remaining letters after the first:b,f,p,v : 1

c,g,j,k,q,s,x,z : 2

d,t : 3

l : 4

m n : 5

r : 6

34

Page 35: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

The Soundex algorithm

3. if two or more letters with the same code were adjacent in the original name, omit all but the first

4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits

Examples:

Euler: E460, Gauss: G200, Hilbert: H416, Knuth :K530, Lloyd: L300

same as Ellery, Ghosh, Heilbronn, Kant, and Ladd

Some problems: Rogers and Rodgers, Sinclair and StClair

35

Page 36: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Levenshtein edit distance

• Examples:– Theatre-> theater

– Ghaddafi->Qadafi

– Computer->counter

• Edit distance (inserts, deletes, substitutions)– Edit transcript

• Done through dynamic programming

36

Page 37: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Recurrence relation

• Three dependencies– D(i, 0)=i

– D(0, j)=j

– D(i, j)=min[D(i-1,j)+1, D(1,j-1)+1, D(i-1,j-1)+t(i,j)]

• Simple edit distance: – t(i, j) = 0 iff S1(i) = S2(j)

• Target: D(l1, l2)

37

Page 38: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Example

Gusfield 1997

W R I T E R S

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

V 1 1

I 2 2

N 3 3

T 4 4

N 5 5

E 6 6

R 7 7

38

Page 39: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Example (cont’d)

Gusfield 1997

W R I T E R S

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

V 1 1 1 2 3 4 5 6 7

I 2 2 2 2 2 3 4 5 6

N 3 3 3 3 3 3 4 5 6

T 4 4 4 4 4 *

N 5 5

E 6 6

R 7 7

39

Page 40: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Tracebacks

Gusfield 1997

W R I T E R S

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

V 1 1 1 2 3 4 5 6 7

I 2 2 2 2 2 3 4 5 6

N 3 3 3 3 3 3 4 5 6

T 4 4 4 4 4 *

N 5 5

E 6 6

R 7 7

40

Page 41: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Weighted edit distance

• Used to emphasize the relative cost of different edit operations

• Useful in bioinformatics– Homology information– BLAST– Blosum– http://eta.embl-heidelberg.de:8000/misc/mat/b

losum50.html

41

Page 42: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Links

• Web sites:– http://www.merriampark.com/ld.htm – http://odur.let.rug.nl/~kleiweg/lev/

• Demo:– http

://nayana.ece.ucsb.edu/imsearch/imsearch.html

42

Page 43: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

IR Winter 2010

… Index Compression

IR Toolkits

Page 44: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Inverted index compression

• Compress the postings• Observations

– Inverted list is sorted (e.g., by docid or termfq)– Small numbers tend to occur more frequently

• Implications– “d-gap” (store difference): d1, d2-d1, d3-d2-d1,…– Exploit skewed frequency distribution: fewer bits for

small (high frequency) integers

• Binary code, unary code, -code, -code

44

Page 45: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Integer compression

• In general, to exploit skewed distribution• Binary: equal-length coding• Unary: x1 is coded as x-1 one bits followed by

0, e.g., 3=> 110; 5=>11110 -code: x=> unary code for 1+log x followed by

uniform code for x-2 log x in log x bits, e.g., 3=>101, 5=>11001

-code: same as -code ,but replace the unary prefix with -code. E.g., 3=>1001, 5=>10101

45

Page 46: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Text compression

• Compress the dictionaries

• Methods– Fixed length codes– Huffman coding– Ziv-Lempel codes

46

Page 47: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Fixed length codes

• Binary representations– ASCII– Representational power (2k symbols where k

is the number of bits)

47

Page 48: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Variable length codes

• Alphabet:A .-  N -.  0 -----B -...  O ---  1 .----C -.-.  P .--.  2 ..---D -..  Q --.-  3 ...—E .  R .-. 4 ....-F ..-. S ... 5 .....G --. T -  6 -....H .... U ..-  7 --...I ..  V ...-  8 ---..J .---  W .--  9 ----.K -.-  X -..-L .-..  Y -.—M --  Z --..

• Demo:– http://www.scphillips.com/morse/

48

Page 49: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Most frequent letters in English

• Some are more frequently used than others…• Most frequent letters:

– E T A O I N S H R D L U • Demo:

– http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm

• Also: bigrams:– TH HE IN ER AN RE ND AT ON NT

49

Page 50: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Huffman coding

• Developed by David Huffman (1952)• Average of 5 bits per character (37.5%

compression)• Based on frequency distributions of

symbols• Algorithm: iteratively build a tree of

symbols starting with the two least frequent symbols

50

Page 51: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Symbol Frequency

A 7

B 4

C 10

D 5

E 2

F 11

G 15

H 3

I 7

J 8

51

Page 52: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

c

b d

f

g

i j

he

a

The Huffman tree

52

Page 53: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Symbol Code

A 0110

B 0010

C 000

D 0011

E 01110

F 010

G 10

H 01111

I 110

J 111

53

Page 54: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Exercise

• Consider the bit string: 01101101111000100110001110100111000110101101011101

• Use the Huffman code from the example to decode it.

• Why does this work: no one is a prefix of others

• Try inserting, deleting, and switching some bits at random locations and try decoding.

54

Page 55: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Extensions

• Word-based

• Domain/genre dependent models

55

Page 56: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Links on text compression

• Data compression:– http://www.data-compression.info/

• Calgary corpus:– http://links.uwaterloo.ca/calgary.corpus.html

• Huffman coding:– http://www.compressconsult.com/huffman/ – http://en.wikipedia.org/wiki/Huffman_coding

• LZ– http://en.wikipedia.org/wiki/LZ77

56

Page 57: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Open Source IR Toolkits

• Smart (Cornell)• MG (RMIT & Melbourne, Australia; Waikato,

New Zealand), • Lemur (CMU/Univ. of Massachusetts)• Terrier (Glasgow)• Clair (University of Michigan)• Lucene (Open Source)• Ivory (University of Maryland – cloud computing)

57

Page 58: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Smart

• The most influential IR system/toolkit• Developed at Cornell since 1960’s • Vector space model with lots of weighting

options• Written in C • The Cornell/AT&T groups have used the Smart

system to achieve top TREC performance

58

Page 59: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

MG

• A highly efficient toolkit for retrieval of text and images

• Developed by people at Univ. of Waikato, Univ. of Melbourne, and RMIT in 1990’s

• Written in C, running on Unix• Vector space model with lots of compression

and speed up tricks• People have used it to achieve good TREC

performance

59

Page 60: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Lemur/Indri

• An IR toolkit emphasizing language models• Developed at CMU and Univ. of Massachusetts

in 2000’s• Written in C++, highly extensible• Vector space and probabilistic models including

language models• Achieving good TREC performance with a

simple language model

60

Page 61: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Terrier

• A large-scale retrieval toolkit with lots of applications (e.g., desktop search) and TREC support

• Developed at University of Glasgow, UK• Written in Java, open source• “Divergence from randomness” retrieval model

and other modern retrieval formulas

61

Page 62: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

Lucene

• Open Source IR toolkit • Initially developed by Doug Cutting in Java• Now has been ported to some other languages• Good for building IR/Web applications • Many applications have been built using Lucene

(e.g., Nutch Search Engine)• Currently the retrieval algorithms have poor

accuracy

62

Page 63: Document Preprocessing and Indexing SI650: Information Retrieval

2010 © University of Michigan

What You Should Know

• What is an inverted index• Why does an inverted index help make search fast• How to construct a large inverted index• How to preprocess documents to reduce the index terms• How to compress an index• IR toolkits