(c) 2003, the university of michigan1 information retrieval handout #4 january 28, 2005

31
(C) 2003, The University of Michigan 1 Information Retrieval Handout #4 January 28, 2005

Upload: philip-russell

Post on 17-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #4

January 28, 2005

(C) 2003, The University of Michigan 2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: M 11-12 & Th 12-1 or via email

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

(C) 2003, The University of Michigan 3

Arithmetic coding

(C) 2003, The University of Michigan 4

Arithmetic coding

• Uses probabilities

• Achieves about 2.5 bits per character – close to optimal

• (Rissanen and Langdon 1979, Witten, Neal, and Cleary 1987)

(C) 2003, The University of Michigan 5

Symbol Initial Aftera

Afterab

Afteraba

Afterabac

Afterabacu

Afterabacus

a 1/5 2/6 2/7 3/8 3/9 3/10 3/11

b 1/5 1/6 2/7 2/8 2/9 2/10 2/11

c 1/5 1/6 1/7 1/8 2/9 2/10 2/11

s 1/5 1/6 1/7 1/8 1/9 1/10 2/11

u 1/5 1/6 1/7 1/8 1/9 2/10 2/11

UpperBound

1.000 0.200 0.1000 0.076190 0.073809 0.073809 0.073795

LowerBound

0.000 0.000 0.0666 0.066666 0.072619 0.073767 0.073781

(C) 2003, The University of Michigan 6

Exercise

• Assuming the alphabet consists of a, b, and c, develop arithmetic encodings for the following strings:

aaa aababa baaabc cabcba bac

(C) 2003, The University of Michigan 7

Stemming

(C) 2003, The University of Michigan 8

Goals

• Motivation:– Computer, computers, computerize,

computational, computerization– User, users, using, used

• Representing related words as one token• Simplify matching• Reduce storage and computation• Also known as: term conflation

(C) 2003, The University of Michigan 9

Methods

• Manual (tables)– Achievement achiev

– Achiever achiev

– Etc.

• Affix removal (Harman 1991, Frakes 1992)– if a word ends in “ies” but not “eies” or “aies” then “ies” “y”

– If a word ends in “es” but not “aes”, “ees”, or “oes”, then “es” “e”

– If a word ends in “s” but not “us” or “ss” then “s” NULL

– (apply only the first applicable rule)

(C) 2003, The University of Michigan 10

Porter’s algorithm (Porter 1980)

• Home page:– http://www.tartarus.org/~martin/PorterStemmer

• Reading assignment:– http://www.tartarus.org/~martin/PorterStemmer/def.txt

• Consonant-vowel sequences:– CVCV ... C– CVCV ... V– VCVC ... C– VCVC ... V– Shorthand: [C]VCVC ... [V]

(C) 2003, The University of Michigan 11

Porter’s algorithm (cont’d)• [C](VC){m}[V]

• {m} indicates repetition

• Examples:• m=0 TR, EE, TREE, Y, BY• m=1 TROUBLE, OATS, TREES, IVY• m=2 TROUBLES, PRIVATE, OATEN

• Conditions:• *S - the stem ends with S (and similarly for the other

letters).• *v* - the stem contains a vowel.• *d - the stem ends with a double consonant (e.g. -TT, -SS).• *o - the stem ends cvc, where the second c is not W, X or

Y (e.g. -WIL, -HOP).

(C) 2003, The University of Michigan 12

Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat

Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing

Step 1b1If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter

hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz

(m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file

(C) 2003, The University of Michigan 13

Step 1c (*v*) Y -> I happy -> happi sky -> sky

Step 2(m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous(m>0) IZATION -> IZE vietnamization -> vietnamize(m>0) ATION -> ATE predication -> predicate(m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal(m>0) IVENESS -> IVE decisiveness -> decisive(m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

(C) 2003, The University of Michigan 14

Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good

Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler

(C) 2003, The University of Michigan 15

Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas

Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll

(C) 2003, The University of Michigan 16

Porter’s algorithm (cont’d)Example: the word “duplicatable”

duplicat rule 4duplicate rule 1b1duplic rule 3

The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.

% cd /clair4/class/ir-w03/tf-idf% ./stem.pl computers computers comput

(C) 2003, The University of Michigan 17

Porter’s algorithm

Computable Comput

Intervention Intervent

Retrieval Retriev

Document Docum

Representing Repres

Representative Repres

(C) 2003, The University of Michigan 18

Stemming

• Not always appropriate (e.g., proper names, titles)

• The same applies to casing (e.g., CAT vs. cat)

(C) 2003, The University of Michigan 19

String matching

(C) 2003, The University of Michigan 20

String matching methods

• Index-based

• Full or approximate– E.g., theater = theatre

(C) 2003, The University of Michigan 21

Index-based matching

• Inverted files

• Position-based inverted files

• Block-based inverted files

1 6 9 11 1719 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters.

Text: 11, 19

Words: 33, 40

From: 55

(C) 2003, The University of Michigan 22

Inverted index (trie)

Letters: 60

Text: 11, 19

Words: 33, 40

Made: 50

Many: 28

l

m

t

w

ad

n

(C) 2003, The University of Michigan 23

Sequential searching

• No indexing structure given• Given: database d and search pattern p.

– Example: find “words” in the earlier example

• Brute force method– try all possible starting positions

– O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn)

– Typical runtime is actually O(n) given that mismatches are easy to notice

(C) 2003, The University of Michigan 24

Knuth-Morris-Pratt

• Average runtime similar to BF

• Worst case runtime is linear: O(n)

• Idea: reuse knowledge

• Need preprocessing of the pattern

(C) 2003, The University of Michigan 25

Knuth-Morris-Pratt (cont’d)

• Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm)

database: ABC ABC ABC ABDAB ABCDABCDABDE

pattern: ABCDABD

index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0

1234567ABCDABD ABCDABD

(C) 2003, The University of Michigan 26

Knuth-Morris-Pratt (cont’d)ABC ABC ABC ABDAB ABCDABCDABDEABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

(C) 2003, The University of Michigan 27

Boyer-Moore

• Used in text editors

• Demos– http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html

– http://www.blarg.com/~doyle/pages/bmi.html

(C) 2003, The University of Michigan 28

Other methods

• The Soundex algorithm (Odell and Russell)

• Uses:– spelling correction– hash function– non-recoverable

(C) 2003, The University of Michigan 29

Word similarity

• Hamming distance - when words are of the same length

• Levenshtein distance - number of edits (insertions, deletions, replacements)– color --> colour (1)– survey --> surgery (2)– com puter --> computer ?

• Longest common subsequence (LCS)– lcs (survey, surgery) = surey

(C) 2003, The University of Michigan 30

The Soundex algorithm

1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions

2. Assign the following numbers to the remaining letters after the first:b,f,p,v : 1

c,g,j,k,q,s,x,z : 2

d,t : 3

l : 4

m n : 5

r : 6

(C) 2003, The University of Michigan 31

The Soundex algorithm

3. if two or more letters with the same code were adjacent in the original name, omit all but the first

4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits

Examples:

Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300

same as Ellery, Ghosh, Heilbronn, Kant, and Ladd

Some problems: Rogers and Rodgers, Sinclair and StClair