(c) 2003, the university of michigan1 information retrieval handout #2 january 14, 2005

41
(C) 2003, The University of Michigan 1 Information Retrieval Handout #2 January 14, 2005

Upload: gregory-perkins

Post on 11-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #2

January 14, 2005

Page 2: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: M 11-12 & Th 12-1 or via email

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

Page 3: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 3

Evaluation

Page 4: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 4

Relevance

• Difficult to change: fuzzy, inconsistent

• Methods: exhaustive, sampling, pooling, search-based

Page 5: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 5

Contingency table

w x

y z

n2 = w + y

n1 = w + x

N

relevant

not relevant

retrieved not retrieved

Page 6: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 6

Precision and Recall

Recall:

Precision:

w

w+y

w+x

w

Page 7: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 7

Exercise

Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.

Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.

Later, try different queries.

Page 8: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 8

n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00

2 589 x 0.4 1.00

3 576 0.4 0.67

4 590 x 0.6 0.75

5 986 0.6 0.60

6 592 x 0.8 0.67

7 984 0.8 0.57

8 988 0.8 0.50

9 578 0.8 0.44

10 985 0.8 0.40

11 103 0.8 0.36

12 591 0.8 0.33

13 772 x 1.0 0.38

14 990 1.0 0.36

[From Salton’s book]

Page 9: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 9

P/R graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

Interpolated average precision (e.g., 11pt)Interpolation – what is precision at recall=0.5?

Page 10: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 10

Issues

• Why not use accuracy A=(w+z)/N?• Average precision• Average P at given “document cutoff values”• Report when P=R• F measure: F=(2+1)PR/(2P+R)• F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of

P and R• When do F and F1 report the wrong results?

Page 11: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 11

Kappa

• N: number of items (index i)

• n: number of categories (index j)

• k: number of annotators

)(1

)()(

EP

EPAP

N

i

n

jij k

mkNk

AP1 1

2

1

1

)1(

1)(

2

1

1

)(

Nk

mEP

N

iijn

j

Page 12: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 12

Kappa example (from Manning, Schuetze, Raghavan)

J1+ J1-

J2+ 300 10

J2- 20 70

Page 13: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 13

Kappa (cont’d)

• P(A) = 370/400• P (-) = (10+20+20+70)/800 = 0.2125• P (+) = (10+20+300+300)/800 = 0.7878• P (E) = 0.2125 * 0.2125 + 0.7878 * 0.7878

= 0.665• K = (0.925-0.665)/(1-0.665) = 0.776• Kappa higher than 0.67 is tentatively

acceptable; higher than 0.8 is good

Page 14: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 14

Relevance collections

• TREC ad hoc collections, 2-6 GB

• TREC Web collections, 2-100GB

Page 15: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 15

Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles

<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>

LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172

LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208

Page 16: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 16

<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE><SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION><LENGTH><P>586 words </P></LENGTH><HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE><BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE><TEXT><P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P><P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. </P><P>Several Fatalities </P><P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P><P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. </P>...</TEXT><GRAPHIC><P> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </P></GRAPHIC><SUBJECT><P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P></SUBJECT></DOC>

Page 17: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 17

TREC (cont’d)

• http://trec.nist.gov/tracks.html• http://

trec.nist.gov/presentations/presentations.html

Page 18: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 18

Word distribution models

Page 19: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 19

Shakespeare

• Romeo and Juliet:• And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me,

262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60;

• …

• A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1;

http://www.mta75.org/curriculum/english/Shakes/indexx.html

Page 20: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 20

The BNC (Adam Kilgarriff)• 1 6187267 the det• 2 4239632 be v• 3 3093444 of prep• 4 2687863 and conj• 5 2186369 a det• 6 1924315 in prep• 7 1620850 to infinitive-marker• 8 1375636 have v• 9 1090186 it pron• 10 1039323 to prep• 11 887877 for prep• 12 884599 i pron• 13 760399 that conj• 14 695498 you pron• 15 681255 he pron• 16 680739 on prep• 17 675027 with prep• 18 559596 do v• 19 534162 at prep• 20 517171 by prep

Kilgarriff, A. Putting Frequencies in the Dictionary.International Journal of Lexicography10 (2) 1997. Pp 135--155

Page 21: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 21

Stop lists• 250-300 most common words in English

account for 50% or more of a given text.

• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).

• Token/type ratio: 2256/859 = 2.63

Page 22: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 22

Zipf’s law

Rank x Frequency Constant

Rank Term Freq. Z Rank Term Freq. Z

1 the 69,971 0.070 6 in 21,341 0.128

2 of 36,411 0.073 7 that 10,595 0.074

3 and 28,852 0.086 8 is 10,099 0.081

4 to 26.149 0.104 9 was 9,816 0.088

5 a 23,237 0.116 10 he 9,543 0.095

Page 23: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 23

Zipf's law is fairly general!

• Frequency of accesses to web pages • in particular the access counts on the Wikipedia page,with s approximately equal to 0.3 • page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5

• Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0.5

• Sizes of settlements• Income distributions amongst individuals • Size of earthquakes• Notes in musical performances

http://en.wikipedia.org/wiki/Zipf's_law

Page 24: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 24

Zipf’s law (cont’d)

• Limitations:– Low and high frequencies– Lack of convergence

• Power law with coefficient c = -1– Y=kxc

• Li (1992) – typing words one letter at a time, including spaces

Page 25: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 25

Heap’s law• Size of vocabulary: V(n) = Kn

• In English, K is between 10 and 100, β is between 0.4 and 0.6.

n

V(n)

http://en.wikipedia.org/wiki/Heaps%27_law

Page 26: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 26

Heap’s law (cont’d)

• Related to Zipf’s law: generative models

• Zipf’s and Heap’s law coefficients change with language

Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc.CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335.

Page 27: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 27

Indexing

Page 28: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 28

Methods

• Manual: e.g., Library of Congress subject headings, MeSH

• Automatic

Page 29: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 29

LOC subject headings

http://www.loc.gov/catdir/cpso/lcco/lcco.html

A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

Page 30: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 30

MedicineCLASS R - MEDICINESubclass RR5-920 Medicine (General)R5-130.5 General worksR131-687 History of medicine. Medical expeditionsR690-697 Medicine as a profession. PhysiciansR702-703 Medicine and the humanities. Medicine and disease in

relation to history, literature, etc.R711-713.97 DirectoriesR722-722.32 Missionary medicine. Medical missionariesR723-726 Medical philosophy. Medical ethicsR726.5-726.8 Medicine and disease in relation to psychology.

Terminal care. DyingR727-727.5 Medical personnel and the public. Physician and the

publicR728-733 Practice of medicine. Medical practice economicsR735-854 Medical education. Medical schools. ResearchR855-855.5 Medical technologyR856-857 Biomedical engineering. Electronics. InstrumentationR858-859.7 Computer applications to medicine. Medical informaticsR864 Medical recordsR895-920 Medical physics. Medical radiology. Nuclear medicine

Page 31: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 31

Finding the most frequent terms in a document

• Typically stop words: the, and, in

• Not content-bearing

• Terms vs. words

• Luhn’s method

Page 32: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 32

Luhn’s method

WORDS

FREQUENCY

E

Page 33: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 33

Computing term salience

• Term frequency (IDF)

• Document frequency (DF)

• Inverse document frequency (IDF)

N

wDFwIDF

)(log)(

Page 34: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 34

Scripts to compute tf and idf

cd /clair4/class/ir-w03/tf-idf./tf.pl 053.txt | sort -nr +1 | more./tfs.pl 053.txt | sort -nr +1 | more./stem.pl reasonableness./build-df.pl./idf.pl | sort -n +2 | more

Page 35: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 35

Applications of TFIDF

• Cosine similarity

• Indexing

• Clustering

Page 36: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 36

Variants of TF*IDF

• E.g., Okapi (Robertson)

• TF/(k+TF)

• k is from 1 to 2

Page 37: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 37

Vector-based matching

• The cosine measure

sim (D,C) =

(dk . ck . idf(k))

(dk)2 . (ck)2

k

k

k

Page 38: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 38

IDF: Inverse document frequency

N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i

idfk = log2(N/dk) + 1 = log2N - log2dk + 1

TF * IDF is used for automated indexing and for topicdiscrimination:

Page 39: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 39

Asian and European news

622.941 deng306.835 china196.725 beijing153.608 chinese152.113 xiaoping124.591 jiang108.777 communist102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people

97.487 nato92.151 albright74.652 belgrade46.657 enlargement34.778 alliance34.778 french33.803 opposition32.571 russia14.095 government 9.389 told 9.154 would 8.459 their 6.059 which

Page 40: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 40

Other topics

120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center

74.652 compuserve65.321 massey55.989 salizzoni29.996 bob27.994 online27.198 executive15.890 interim15.271 chief11.647 service11.174 second 6.781 world 6.315 president

Page 41: (C) 2003, The University of Michigan1 Information Retrieval Handout #2 January 14, 2005

(C) 2003, The University of Michigan 41

Software

• KEA: http://www.nzdl.org/Kea/• Example:

– Paper: “Protocols for secure, atomic transaction execution in electronic commerce”

– Author: anonymity, atomicity, auction, electronic commerce, privacy, real-time, security, transaction

– Kea: atomicity, auction, customer, electronic commerce, intruder, merchant, protocol, security, third party, transaction