(c) 2005, the university of michigan1 information retrieval dragomir r. radev university of michigan...

234
(C) 2005, The University of Michigan 1 Information Retrieval Dragomir R. Radev University of Michigan [email protected] September 19, 2005

Upload: chase-beemer

Post on 31-Mar-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 1

Information Retrieval

Dragomir R. RadevUniversity of Michigan

[email protected]

September 19, 2005

Page 2: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 2

About the instructor

• Dragomir R. Radev• Associate Professor, University of Michigan

– School of Information– Department of Electrical Engineering and Computer Science– Department of Linguistics

• Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan

• Treasurer, North American Chapter of the ACL• Ph.D., 1998, Computer Science, Columbia University• Email: [email protected]• Home page: http://tangra.si.umich.edu/~radev

Page 3: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 3

Introduction

Page 4: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 4

IR systems

• Google

• Vivísimo

• AskJeeves

• NSIR

• Lemur

• MG

• Nutch

Page 5: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 5

Examples of IR systems

• Conventional (library catalog). Search by keyword, title, author, etc.

• Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language.

• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).

• Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language

Page 6: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 6

Page 7: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 7

Page 8: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 8

Need for IR

• Advent of WWW - more than 8 Billion documents indexed on Google

• How much information? 200TB according to Lyman and Varian 2003.http://www.sims.berkeley.edu/research/projects/how-much-info/

• Search, routing, filtering

• User’s information need

Page 9: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 9

Some definitions of Information Retrieval (IR)

Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”

Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”

Page 10: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 10

Sample queries (from Excite)

In what year did baseball become an offical sport?play station codes . combirth control and depressiongovernment"WorkAbility I"+conferencekitchen applianceswhere can I find a chines rosewoodtiger electronics58 Plymouth FuryHow does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?emeril LagasseHubbleM.S Subalaksmirunning

Page 11: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 11

Mappings and abstractions

Reality Data

Information need Query

From Korfhage’s book

Page 12: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 12

Typical IR system

• (Crawling)

• Indexing

• Retrieval

• User interface

Page 13: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 13

Key Terms Used in IR

• QUERY: a representation of what the user is looking for - can be a list of words or a phrase.

• DOCUMENT: an information entity that the user wants to retrieve

• COLLECTION: a set of documents

• INDEX: a representation of information that makes querying easier

• TERM: word or concept that appears in a document or a query

Page 14: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 14

Documents

Page 15: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 15

Documents

• Not just printed paper

• collections vs. documents

• data structures: representations

• Bag of words method

• document surrogates: keywords, summaries

• encoding: ASCII, Unicode, etc.

Page 16: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 16

Document preprocessing

• Formatting

• Tokenization (Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc)

• Casing (cat vs. CAT)

• Stemming (computer, computation)

• Soundex

Page 17: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 17

Document representations

• Term-document matrix (m x n)

• term-term matrix (m x m x n)

• document-document matrix (n x n)

• Example: 3,000,000 documents (n) with 50,000 terms (m)

• sparse matrices

• Boolean vs. integer matrices

Page 18: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 18

Document representations

• Term-document matrix– Evaluating queries (e.g., (AB)C)– Storage issues

• Inverted files– Storage issues– Evaluating queries– Advantages and disadvantages

Page 19: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 19

IR models

Page 20: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 20

Major IR models

• Boolean

• Vector

• Probabilistic

• Language modeling

• Fuzzy retrieval

• Latent semantic indexing

Page 21: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 21

Major IR tasks

• Ad-hoc

• Filtering and routing

• Question answering

• Spoken document retrieval

• Multimedia retrieval

Page 22: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 22

Venn diagrams

x w y z

D1D2

Page 23: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 23

Boolean model

A B

Page 24: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 24

restaurants AND (Mideastern OR vegetarian) AND inexpensive

Boolean queries

• What types of documents are returned?

• Stemming

• thesaurus expansion

• inclusive vs. exclusive OR

• confusing uses of AND and OR

dinner AND sports AND symphony

4 OF (Pentium, printer, cache, PC, monitor, computer, personal)

Page 25: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 25

Boolean queries• Weighting (Beethoven AND sonatas)

• precedence

coffee AND croissant OR muffin

raincoat AND umbrella OR sunglasses

• Use of negation: potential problems

• Conjunctive and Disjunctive normal forms

• Full CNF and DNF

Page 26: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 27

Boolean model

• Partition

• Partial relevance?

• Operators: AND, NOT, OR, parentheses

Page 27: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 28

Exercise

• D1 = “computer information retrieval”

• D2 = “computer retrieval”

• D3 = “information”

• D4 = “computer information”

• Q1 = “information retrieval”

• Q2 = “information ¬computer”

Page 28: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 29

Exercise0

1 Swift

2 Shakespeare

3 Shakespeare Swift

4 Milton

5 Milton Swift

6 Milton Shakespeare

7 Milton Shakespeare Swift

8 Chaucer

9 Chaucer Swift

10 Chaucer Shakespeare

11 Chaucer Shakespeare Swift

12 Chaucer Milton

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Page 29: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 30

Stop lists• 250-300 most common words in English

account for 50% or more of a given text.

• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).

• Token/type ratio: 2256/859 = 2.63

Page 30: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 31

Vector modelsTerm 1

Term 2

Term 3

Doc 1

Doc 2

Doc 3

Page 31: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 32

Vector queries

• Each document is represented as a vector

• non-efficient representations (bit vectors)

• dimensional compatibility

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

Page 32: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 33

The matching process

• Document space

• Matching is done between a document and a query (or between two documents)

• distance vs. similarity

• Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.

Page 33: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 34

Miscellaneous similarity measures

• The Cosine measure

(D,Q) = = (di x qi)

(di)2 * (qi)2

|X Y|

|X| * |Y|

(D,Q) =|X Y|

|X Y|

• The Jaccard coefficient

Page 34: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 35

Exercise

• Compute the cosine measures (D1,D2) and (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1>

• Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.

Page 35: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 36

Evaluation

Page 36: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 37

Relevance

• Difficult to change: fuzzy, inconsistent

• Methods: exhaustive, sampling, pooling, search-based

Page 37: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 38

Contingency table

w x

y z

n2 = w + y

n1 = w + x

N

relevant

not relevant

retrieved not retrieved

Page 38: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 39

Precision and Recall

Recall:

Precision:

w

w+y

w+x

w

Page 39: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 40

Exercise

Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.

Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.

Later, try different queries.

Page 40: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 41

n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00

2 589 x 0.4 1.00

3 576 0.4 0.67

4 590 x 0.6 0.75

5 986 0.6 0.60

6 592 x 0.8 0.67

7 984 0.8 0.57

8 988 0.8 0.50

9 578 0.8 0.44

10 985 0.8 0.40

11 103 0.8 0.36

12 591 0.8 0.33

13 772 x 1.0 0.38

14 990 1.0 0.36

[From Salton’s book]

Page 41: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 42

P/R graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

Interpolated average precision (e.g., 11pt)Interpolation – what is precision at recall=0.5?

Page 42: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 43

Issues

• Why not use accuracy A=(w+z)/N?• Average precision• Average P at given “document cutoff

values”• Report when P=R• F measure: F=(2+1)PR/(2P+R)• F1 measure: F1 = 2/(1/R+1/P) : harmonic

mean of P and R

Page 43: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 47

Relevance collections

• TREC ad hoc collections, 2-6 GB

• TREC Web collections, 2-100GB

Page 44: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 48

Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles

<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>

LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172

LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208

Page 45: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 49

<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE><SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION><LENGTH><P>586 words </P></LENGTH><HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE><BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE><TEXT><P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P><P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. </P><P>Several Fatalities </P><P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P><P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. </P>...</TEXT><GRAPHIC><P> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </P></GRAPHIC><SUBJECT><P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P></SUBJECT></DOC>

Page 46: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 50

TREC (cont’d)

• http://trec.nist.gov/tracks.html• http://

trec.nist.gov/presentations/presentations.html

Page 47: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 51

Word distribution models

Page 48: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 52

Shakespeare

• Romeo and Juliet:• And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me,

262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60;

• …

• A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1;

http://www.mta75.org/curriculum/english/Shakes/indexx.html

Page 49: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 53

The BNC (Adam Kilgarriff)• 1 6187267 the det• 2 4239632 be v• 3 3093444 of prep• 4 2687863 and conj• 5 2186369 a det• 6 1924315 in prep• 7 1620850 to infinitive-marker• 8 1375636 have v• 9 1090186 it pron• 10 1039323 to prep• 11 887877 for prep• 12 884599 i pron• 13 760399 that conj• 14 695498 you pron• 15 681255 he pron• 16 680739 on prep• 17 675027 with prep• 18 559596 do v• 19 534162 at prep• 20 517171 by prep

Kilgarriff, A. Putting Frequencies in the Dictionary.International Journal of Lexicography10 (2) 1997. Pp 135--155

Page 50: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 54

Stop lists• 250-300 most common words in English

account for 50% or more of a given text.

• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).

• Token/type ratio: 2256/859 = 2.63

Page 51: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 55

Zipf’s law

Rank x Frequency Constant

Rank Term Freq. Z Rank Term Freq. Z

1 the 69,971 0.070 6 in 21,341 0.128

2 of 36,411 0.073 7 that 10,595 0.074

3 and 28,852 0.086 8 is 10,099 0.081

4 to 26.149 0.104 9 was 9,816 0.088

5 a 23,237 0.116 10 he 9,543 0.095

Page 52: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 56

Zipf's law is fairly general!

• Frequency of accesses to web pages • in particular the access counts on the Wikipedia page,with s approximately equal to 0.3 • page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5

• Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0.5

• Sizes of settlements• Income distributions amongst individuals • Size of earthquakes• Notes in musical performances

http://en.wikipedia.org/wiki/Zipf's_law

Page 53: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 57

Zipf’s law (cont’d)

• Limitations:– Low and high frequencies– Lack of convergence

• Power law with coefficient c = -1– Y=kxc

• Li (1992) – typing words one letter at a time, including spaces

Page 54: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 60

Indexing

Page 55: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 61

Methods

• Manual: e.g., Library of Congress subject headings, MeSH

• Automatic

Page 56: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 62

LOC subject headings

http://www.loc.gov/catdir/cpso/lcco/lcco.html

A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

Page 57: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 63

MedicineCLASS R - MEDICINESubclass RR5-920 Medicine (General)R5-130.5 General worksR131-687 History of medicine. Medical expeditionsR690-697 Medicine as a profession. PhysiciansR702-703 Medicine and the humanities. Medicine and disease in

relation to history, literature, etc.R711-713.97 DirectoriesR722-722.32 Missionary medicine. Medical missionariesR723-726 Medical philosophy. Medical ethicsR726.5-726.8 Medicine and disease in relation to psychology.

Terminal care. DyingR727-727.5 Medical personnel and the public. Physician and the

publicR728-733 Practice of medicine. Medical practice economicsR735-854 Medical education. Medical schools. ResearchR855-855.5 Medical technologyR856-857 Biomedical engineering. Electronics. InstrumentationR858-859.7 Computer applications to medicine. Medical informaticsR864 Medical recordsR895-920 Medical physics. Medical radiology. Nuclear medicine

Page 58: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 64

Finding the most frequent terms in a document

• Typically stop words: the, and, in

• Not content-bearing

• Terms vs. words

• Luhn’s method

Page 59: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 65

Luhn’s method

WORDS

FREQUENCY

E

Page 60: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 66

Computing term salience

• Term frequency (IDF)

• Document frequency (DF)

• Inverse document frequency (IDF)

N

wDFwIDF

)(log)(

Page 61: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 67

Applications of TFIDF

• Cosine similarity

• Indexing

• Clustering

Page 62: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 68

Vector-based matching

• The cosine measure

sim (D,C) =

(dk . ck . idf(k))

(dk)2 . (ck)2

k

k

k

Page 63: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 69

IDF: Inverse document frequency

N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i

idfk = log2(N/dk) + 1 = log2N - log2dk + 1

TF * IDF is used for automated indexing and for topicdiscrimination:

Page 64: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 70

Asian and European news

622.941 deng306.835 china196.725 beijing153.608 chinese152.113 xiaoping124.591 jiang108.777 communist102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people

97.487 nato92.151 albright74.652 belgrade46.657 enlargement34.778 alliance34.778 french33.803 opposition32.571 russia14.095 government 9.389 told 9.154 would 8.459 their 6.059 which

Page 65: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 71

Other topics

120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center

74.652 compuserve65.321 massey55.989 salizzoni29.996 bob27.994 online27.198 executive15.890 interim15.271 chief11.647 service11.174 second 6.781 world 6.315 president

Page 66: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 72

Compression

Page 67: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 73

Compression

• Methods– Fixed length codes– Huffman coding– Ziv-Lempel codes

Page 68: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 74

Fixed length codes

• Binary representations– ASCII– Representational power (2k symbols where k is

the number of bits)

Page 69: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 75

Variable length codes• Alphabet:

A .-  N -.  0 -----B -...  O ---  1 .----C -.-.  P .--.  2 ..---D -..  Q --.-  3 ...—E .  R .-. 4 ....-F ..-. S ... 5 .....G --. T -  6 -....H .... U ..-  7 --...I ..  V ...-  8 ---..J .---  W .--  9 ----.K -.-  X -..-L .-..  Y -.—M --  Z --..

• Demo:– http://www.babbage.demon.co.uk/morse.html– http://www.scphillips.com/morse/

Page 70: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 76

Most frequent letters in English

• Most frequent letters:– E T A O I N S H R D L U– http://www.math.cornell.edu/~mec/modules/

cryptography/subs/frequencies.html• Demo:

– http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm

• Also: bigrams:– TH HE IN ER AN RE ND AT ON NT – http://www.math.cornell.edu/~mec/modules/

cryptography/subs/digraphs.html

Page 71: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 77

Useful links about cryptography

• http://world.std.com/~franl/crypto.html

• http://www.faqs.org/faqs/cryptography-faq/

• http://en.wikipedia.org/wiki/Cryptography

Page 72: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 78

Huffman coding

• Developed by David Huffman (1952)• Average of 5 bits per character (37.5%

compression)• Based on frequency distributions of

symbols• Algorithm: iteratively build a tree of

symbols starting with the two least frequent symbols

Page 73: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 79

Symbol Frequency

A 7

B 4

C 10

D 5

E 2

F 11

G 15

H 3

I 7

J 8

Page 74: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 80

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

c

b d

f

g

i j

he

a

Page 75: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 81

Symbol Code

A 0110

B 0010

C 000

D 0011

E 01110

F 010

G 10

H 01111

I 110

J 111

Page 76: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 82

Exercise

• Consider the bit string: 01101101111000100110001110100111000110101101011101

• Use the Huffman code from the example to decode it.

• Try inserting, deleting, and switching some bits at random locations and try decoding.

Page 77: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 83

Ziv-Lempel coding

• Two types - one is known as LZ77 (used in GZIP)

• Code: set of triples <a,b,c>• a: how far back in the decoded text to look

for the upcoming text segment• b: how many characters to copy• c: new character to add to complete segment

Page 78: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 84

• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

Page 79: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 85

Links on text compression

• Data compression:– http://www.data-compression.info/

• Calgary corpus:– http://links.uwaterloo.ca/calgary.corpus.html

• Huffman coding:– http://www.compressconsult.com/huffman/– http://en.wikipedia.org/wiki/Huffman_coding

• LZ– http://en.wikipedia.org/wiki/LZ77

Page 80: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 86

Relevance feedback and

query expansion

Page 81: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 87

Relevance feedback

• Problem: initial query may not be the most appropriate to satisfy a given information need.

• Idea: modify the original query so that it gets closer to the right documents in the vector space

Page 82: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 88

Relevance feedback

• Automatic

• Manual

• Method: identifying feedback termsQ’ = a1Q + a2R - a3N

Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|

Page 83: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 89

Example

• Q = “safety minivans”• D1 = “car safety minivans tests injury statistics” -

relevant• D2 = “liability tests safety” - relevant• D3 = “car passengers injury reviews” - non-

relevant• R = ?• S = ?• Q’ = ?

Page 84: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 90

Pseudo relevance feedback

• Automatic query expansion– Thesaurus-based expansion (e.g., using latent

semantic indexing – later…)– Distributional similarity– Query log mining

Page 85: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 106

String matching

Page 86: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 107

String matching methods

• Index-based

• Full or approximate– E.g., theater = theatre

Page 87: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 108

Index-based matching

• Inverted files

• Position-based inverted files

• Block-based inverted files

1 6 9 11 1719 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters.

Text: 11, 19

Words: 33, 40

From: 55

Page 88: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 109

Inverted index (trie)

Letters: 60

Text: 11, 19

Words: 33, 40

Made: 50

Many: 28

l

m

t

w

ad

n

Page 89: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 110

Sequential searching

• No indexing structure given• Given: database d and search pattern p.

– Example: find “words” in the earlier example

• Brute force method– try all possible starting positions

– O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn)

– Typical runtime is actually O(n) given that mismatches are easy to notice

Page 90: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 111

Knuth-Morris-Pratt

• Average runtime similar to BF

• Worst case runtime is linear: O(n)

• Idea: reuse knowledge

• Need preprocessing of the pattern

Page 91: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 112

Knuth-Morris-Pratt (cont’d)

• Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm)

database: ABC ABC ABC ABDAB ABCDABCDABDE

pattern: ABCDABD

index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0

1234567ABCDABD ABCDABD

Page 92: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 113

Knuth-Morris-Pratt (cont’d)ABC ABC ABC ABDAB ABCDABCDABDEABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

Page 93: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 115

Word similarity

• Hamming distance - when words are of the same length

• Levenshtein distance - number of edits (insertions, deletions, replacements)– color --> colour (1)– survey --> surgery (2)– com puter --> computer ?

• Longest common subsequence (LCS)– lcs (survey, surgery) = surey

Page 94: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 116

Levenshtein edit distance

• Examples:– Theatre-> theater– Ghaddafi->Qadafi– Computer->counter

• Edit distance (inserts, deletes, substitutions)– Edit transcript

• Done through dynamic programming

Page 95: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 117

Recurrence relation

• Three dependencies– D(i,0)=i– D(0,j)=j– D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-

1)+t(i,j)]

• Simple edit distance: – t(i,j) = 0 iff S1(i)=S2(j)

Page 96: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 118

Example

Gusfield 1997

W R I T E R S

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

V 1 1

I 2 2

N 3 3

T 4 4

N 5 5

E 6 6

R 7 7

Page 97: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 119

Example (cont’d)

Gusfield 1997

W R I T E R S

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

V 1 1 1 2 3 4 5 6 7

I 2 2 2 2 2 3 4 5 6

N 3 3 3 3 3 3 4 5 6

T 4 4 4 4 4 *

N 5 5

E 6 6

R 7 7

Page 98: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 120

Tracebacks

Gusfield 1997

W R I T E R S

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

V 1 1 1 2 3 4 5 6 7

I 2 2 2 2 2 3 4 5 6

N 3 3 3 3 3 3 4 5 6

T 4 4 4 4 4 *

N 5 5

E 6 6

R 7 7

Page 99: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 121

Weighted edit distance

• Used to emphasize the relative cost of different edit operations

• Useful in bioinformatics– Homology information– BLAST– Blosum– http://eta.embl-heidelberg.de:8000/misc/mat/

blosum50.html

Page 100: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 122

• Web sites:– http://www.merriampark.com/ld.htm– http://odur.let.rug.nl/~kleiweg/lev/

Page 101: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 123

Clustering

Page 102: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 124

Clustering

• Exclusive/overlapping clusters

• Hierarchical/flat clusters

• The cluster hypothesis– Documents in the same cluster are relevant to

the same query

Page 103: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 125

Representations for document clustering

• Typically: vector-based– Words: “cat”, “dog”, etc.– Features: document length, author name, etc.

• Each document is represented as a vector in an n-dimensional space

• Similar documents appear nearby in the vector space (distance measures are needed)

Page 104: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 126

Hierarchical clusteringDendrograms

http://odur.let.rug.nl/~kleiweg/clustering/clustering.html

E.g., language similarity:

Page 105: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 127

Another example

• Kingdom = animal• Phylum = Chordata• Subphylum = Vertebrata• Class = Osteichthyes• Subclass = Actinoptergyii• Order = Salmoniformes• Family = Salmonidae• Genus = Oncorhynchus• Species = Oncorhynchus kisutch (Coho salmon)

Page 106: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 128

Clustering using dendrograms

REPEATCompute pairwise similaritiesIdentify closest pairMerge pair into single node

UNTIL only one node leftQ: what is the equivalent Venn diagram representation?

Example: cluster the following sentences:

A B C B AA D C C A D EC D E F C D AE F G F D AA C D A B A

Page 107: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 129

Methods

• Single-linkage– One common pair is sufficient– disadvantages: long chains

• Complete-linkage– All pairs have to match– Disadvantages: too conservative

• Average-linkage• Centroid-based (online)

– Look at distances to centroids

• Demo:– /clair4/class/ir-w05/clustering

Page 108: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 130

k-means

• Needed: small number k of desired clusters

• hard vs. soft decisions

• Example: Weka

Page 109: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 131

k-means

1 initialize cluster centroids to arbitrary vectors

2 while further improvement is possible do

3 for each document d do

4 find the cluster c whose centroid is closest to d

5 assign d to cluster c

6 end for

7 for each cluster c do

8 recompute the centroid of cluster c based on its documents

9 end for

10 end while

Page 110: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 132

Example

• Cluster the following vectors into two groups:– A = <1,6>– B = <2,2>– C = <4,0>– D = <3,3>– E = <2,5>– F = <2,1>

Page 111: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 133

Complexity

• Complexity = O(kn) because at each step, n documents have to be compared to k centroids.

Page 112: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 136

Human clustering

• Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).

Page 113: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 137

Lexical networks

Page 114: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 138

Lexical Networks

• Used to represent relationships between words

• Example: WordNet - created by George Miller’s team at Princeton

• Based on synsets (synonyms, interchangeable words) and lexical matrices

Page 115: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 139

Lexical matrix

Word FormsWord

Meanings F1 F2 F3 … Fn

M1 E1,1 E1,2

M2 E1,2

……

Mm Em,n

Page 116: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 140

Synsets

• Disambiguation– {board, plank}– {board, committee}

• Synonyms– substitution– weak substitution– synonyms must be of the same part of speech

Page 117: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 141

$ ./wn board -hypen

Synonyms/Hypernyms (Ordered by Frequency) of noun board

9 senses of board

Sense 1board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping

Sense 2board => sheet, flat solid => artifact, artefact => object, physical object => entity, something

Sense 3board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something

Page 118: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 142

Sense 4display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 5board, gameboard => surface => artifact, artefact => object, physical object => entity, something

Sense 6board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something

Page 119: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 143

Sense 7control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 8circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 9dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Page 120: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 144

Antonymy

• “x” vs. “not-x”

• “rich” vs. “poor”?

• {rise, ascend} vs. {fall, descend}

Page 121: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 145

Other relations

• Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”.

• Hyponymy: {tree} is a hyponym of {plant}.

• Hierarchical structure based on hyponymy (and hypernymy).

Page 122: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 146

Other features of WordNet

• Index of familiarity

• Polysemy

Page 123: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 147

board used as a noun is familiar (polysemy count = 9)

bird used as a noun is common (polysemy count = 5)

cat used as a noun is common (polysemy count = 7)

house used as a noun is familiar (polysemy count = 11)

information used as a noun is common (polysemy count = 5)

retrieval used as a noun is uncommon (polysemy count = 3)

serendipity used as a noun is very rare (polysemy count = 1)

Familiarity and polysemy

Page 124: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 148

Compound nouns

advisory boardappeals boardbackboardbackgammon boardbaseboardbasketball backboardbig boardbillboardbinder's boardbinder board

blackboardboard gameboard measureboard meetingboard memberboard of appealsboard of directorsboard of educationboard of regentsboard of trustees

Page 125: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 149

Overview of senses1. board -- (a committee having supervisory powers; "the board has seven members")2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)4. display panel, display board, board -- (a board on which information can be displayed to public view)5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")

Page 126: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 150

Top-level concepts{act, action, activity}

{animal, fauna}

{artifact}

{attribute, property}

{body, corpus}

{cognition, knowledge}

{communication}

{event, happening}

{feeling, emotion}

{food}

{group, collection}

{location, place}

{motive}

{natural object}

{natural phenomenon}

{person, human being}

{plant, flora}

{possession}

{process}

{quantity, amount}

{relation}

{shape}

{state, condition}

{substance}

{time}

Page 127: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 151

WordNet and DistSim

wn reason -hypen - hypernyms

wn reason -synsn - synsets

wn reason -simsn - synonyms

wn reason -over - overview of senses

wn reason -famln - familiarity/polysemy

wn reason -grepn - compound nouns

/data2/tools/relatedwords/relate reason

Page 128: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 152

System comparison

Page 129: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 153

Comparing two systems

• Comparing A and B

• One query?

• Average performance?

• Need: A to consistently outperform B

[this slide: courtesy James Allan]

Page 130: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 154

The sign test

• Example 1:– A > B (12 times)

– A = B (25 times)

– A < B (3 times)

– p < 0.035 (significant at the 5% level)

• Example 2:– A > B (18 times)

– A < B (9 times)

– p < 0.122 (not significant at the 5% level)[this slide: courtesy James Allan]

Page 131: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 155

Other tests

• The t test:– Takes into account the actual performances, not

just which system is better– http://nimitz.mcs.kent.edu/~blewis/stat/

tTest.html

• The sign test:– http://www.fon.hum.uva.nl/Service/Statistics/

Sign_Test.html

Page 132: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 156

Techniques for dimensionalityreduction: SVD and LSI

Page 133: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 157

Techniques for dimensionality reduction

• Based on matrix decomposition (goal: preserve clusters, explain away variance)

• A quick review of matrices– Vectors

– Matrices

– Matrix multiplication

1

1

2

*

1494

852

321

Page 134: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 158

SVD: Singular Value Decomposition

• A=UVT

• This decomposition exists for all matrices, dense or sparse

• If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3

• In Matlab, use [U,S,V] = svd (A)

Page 135: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 159

Term matrix normalization

1

0

1

1

0

0

000

101

1010

0

0

0

1

1

110

110

101

A

71.0

00.0

71.0

58.0

00.0

00.0

00.000.000.0

45.000.058.0

45.000.058.000.0

00.0

00.0

00.0

58.0

58.0

45.071.000.0

45.071.000.0

45.000.058.0

)(nA

D1 D2 D3 D4 D5

D1 D2 D3 D4 D5

Page 136: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 160

Example (Berry and Browne)

• T1: baby• T2: child• T3: guide• T4: health • T5: home• T6: infant• T7: proofing• T8: safety• T9: toddler

• D1: infant & toddler first aid• D2: babies & children’s room (for

your home)• D3: child safety at home• D4: your baby’s health and safety:

from infant to toddler• D5: baby proofing basics• D6: your guide to easy rust proofing• D7: beanie babies collector’s guide

Page 137: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 161

Document term matrix

0001001

0001100

0110000

0001001

0000110

0001000

1100000

0000110

1011010

A

00045.00071.0

00045.058.000

071.071.00000

00045.00071.0

000058.058.00

00045.0000

71.071.000000

000058.058.00

71.0071.045.0058.00

)(nA

Page 138: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 162

Decompositionu =

-0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356

v =

-0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407

Page 139: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 163

Decomposition

s = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Spread on the v1 axis

Page 140: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 164

Rank-4 approximations4 =

1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 141: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 165

Rank-4 approximationu*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008

Page 142: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 166

Rank-4 approximationu*s4

-1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0

Page 143: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 167

Rank-4 approximations4*v'

-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 144: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 168

Rank-2 approximations2 =

1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 145: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 169

Rank-2 approximationu*s2*v'

0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048

Page 146: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 170

Rank-2 approximationu*s2

-1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0

Page 147: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 171

Rank-2 approximations2*v'

-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 148: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 172

Documents to concepts and terms to concepts

A(:,1)'*u*s

-0.4238 0.6784 -0.8541 0.1446 -0.0000 -0.1853 0.0095

>> A(:,1)'*u*s4

-0.4238 0.6784 -0.8541 0.1446 0 0 0

>> A(:,1)'*u*s2

-0.4238 0.6784 0 0 0 0 0

>> A(:,2)'*u*s2

-1.1233 0.3650 0 0 0 0 0

>> A(:,3)'*u*s2

-0.6762 0.6807 0 0 0 0 0

Page 149: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 173

Documents to concepts and terms to concepts

>> A(:,4)'*u*s2

-0.9972 0.6478 0 0 0 0 0

>> A(:,5)'*u*s2

-1.1809 -0.4914 0 0 0 0 0

>> A(:,6)'*u*s2

-0.7918 -0.8121 0 0 0 0 0

>> A(:,7)'*u*s2

-1.1809 -0.4914 0 0 0 0 0

Page 150: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 174

Cont’d>> (s2*v'*A(1,:)')'

-1.7523 -0.1530 0 0 0 0 0 0 0

>> (s2*v'*A(2,:)')'

-0.6585 0.4768 0 0 0 0 0 0 0

>> (s2*v'*A(3,:)')'

-0.8838 -0.7275 0 0 0 0 0 0 0

>> (s2*v'*A(4,:)')'

-0.2831 0.2291 0 0 0 0 0 0 0

>> (s2*v'*A(5,:)')'

-0.6585 0.4768 0 0 0 0 0 0 0

Page 151: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 175

Cont’d>> (s2*v'*A(6,:)')'

-0.4730 0.6078 0 0 0 0 0 0 0

>> (s2*v'*A(7,:)')'

-0.8838 -0.7275 0 0 0 0 0 0 0

>> (s2*v'*A(8,:)')'

-0.5306 0.5395 0 0 0 0 0 0 0

>> (s2*v'*A(9,:)')‘

-0.4730 0.6078 0 0 0 0 0 0 0

Page 152: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 176

PropertiesA*A'

1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.5041 0 1.0082 0 0 0 0.5041 0 0 0.2025 0 0 0.2025 0 0.2025 0 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 0.5041 0 0.5041 0 0 0 1.0082 0 0 0.2025 0.3364 0 0.2025 0.3364 0.2025 0 0.5389 0.2025 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066

A'*A

1.0082 0 0 0.6390 0 0 0 0 1.0092 0.6728 0.2610 0.4118 0 0.4118 0 0.6728 1.0092 0.2610 0 0 0 0.6390 0.2610 0.2610 1.0125 0.3195 0 0.3195 0 0.4118 0 0.3195 1.0082 0.5041 0.5041 0 0 0 0 0.5041 1.0082 0.5041 0 0.4118 0 0.3195 0.5041 0.5041 1.0082

A is a document to term matrix. What is A*A’, what is A’*A?

Page 153: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 177

Latent semantic indexing (LSI)

• Dimensionality reduction = identification of hidden (latent) concepts

• Query matching in latent space

Page 154: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 178

Useful pointers

• http://lsa.colorado.edu• http://lsi.research.telcordia.com/• http://www.cs.utk.edu/~lsi/• http://javelina.cet.middlebury.edu/lsa/out/

lsa_definition.htm• http://citeseer.nj.nec.com/

deerwester90indexing.html• http://www.pcug.org.au/~jdowling/

Page 155: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 179

Models of the Web

Page 156: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 180

Size

• The Web is the largest repository of data and it grows exponentially.– 320 Million Web pages [Lawrence & Giles 1998]

– 800 Million Web pages, 15 TB [Lawrence & Giles 1999]

– 8 Billion Web pages indexed [Google 2005]

• Amount of data– roughly 200 TB [Lyman et al. 2003]

Page 157: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 181

Bow-tie model of the Web

SCC56 M

OUT44 M

IN44 M

Bröder & al. WWW 2000, Dill & al. VLDB 2001

DISC17 M

TEND44M

24% of pagesreachable froma given page

Page 158: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 182

Power laws

• Web site size (Huberman and Adamic 1999)• Power-law connectivity (Barabasi and Albert

1999): exponents 2.45 for out-degree and 2.1 for the in-degree

• Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.

Page 159: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 183

Small-world networks

• Diameter = average length of the shortest path between all pairs of nodes. Example…

• Milgram experiment (1967)– Kansas/Omaha --> Boston (42/160 letters)– diameter = 6

• Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89.

• Six degrees of separation

Page 160: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 184

Clustering coefficient

• Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors.

• Examples:

n k d drand C crand

Actors 225226 61 3.65 2.99 0.79 0.00027

Power grid 4941 2.67 18.7 12.4 0.08 0.005

C. Elegans 282 14 2.65 2.25 0.28 0.05

Page 161: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 185

Models of the Web

Npkk

kekP

kk

!)(

)()(

k

kP

A

B

a

b

• Erdös/Rényi 59, 60

• Barabási/Albert 99

• Watts/Strogatz 98

• Kleinberg 98

• Menczer 02

• Radev 03

• Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology

Page 162: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 188

Social network analysis for IR

Page 163: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 189

Social networks

• Induced by a relation• Symmetric or not• Examples:

– Friendship networks– Board membership– Citations– Power grid of the US– WWW

Page 164: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 190

Krebs 2004

Page 165: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 191

Prestige and centrality

• Degree centrality: how many neighbors each node has.

• Closeness centrality: how close a node is to all of the other nodes

• Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes

• Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.

• Prestige = same as centrality but for directed graphs.

Page 166: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 192

Graph-based representations

1

2

34

5

7

6 81 2 3 4 5 6 7 8

1 1 1

2 1

3 1 1

4 1

5 1 1 1 1

6 1 1

7

8

Square connectivity(incidence) matrix

Graph G (V,E)

Page 167: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 193

Markov chains

• A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E.

• Path = sequence (x0, x1, …, xn).Xi = xi-1*E

• The probability of a path can be computed as a product of probabilities for each step i.

• Random walk = find Xj given x0, E, and j.

Page 168: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 194

Stationary solutions

• The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:– E is stochastic

– E is irreducible

– E is aperiodic

• To make these conditions true:– All rows of E add up to 1 (and no value is negative)

– Make sure that E is strongly connected

– Make sure that E is not bipartite

• Example: PageRank [Brin and Page 1998]: use “teleportation”

Page 169: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 195

1

2

34

5

7

6 8

Example

This graph E has a second graph E’(not drawn) superimposed on it:E’ is the uniform transition graph.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=1

Page 170: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 196

Eigenvectors

• An eigenvector is an implicit “direction” for a matrix.Mv = λv, where v is non-zero, though λ can be any

complex number in principle.

• The largest eigenvalue of a stochastic matrix E is real: λ1 = 1.

• For λ1, the left (principal) eigenvector is p, the right eigenvector = 1

• In other words, ETp = p.

Page 171: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 197

Computing the stationary distribution

0)(

pEI

pEpT

T

function PowerStatDist (E):begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1)

L = ||p(i)-p(i-1)||1; i = i + 1; until L < return p(i)

end

Solution for thestationary distribution

Page 172: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 198

1

2

34

5

7

6 8

Example

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=10

Page 173: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 199

How Google works

• Crawling

• Anchor text

• Fast query processing

• Pagerank

Page 174: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 200

More about PageRank

• Named after Larry Page, founder of Google (and UM alum)

• Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page.

• Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.

Page 175: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 201

HITS

• Query-dependent model (Kleinberg 97)• Hubs and authorities (e.g., cars, Honda)

• Algorithm– obtain root set using input query– expanded the root set by radius one– run iterations on the hub and authority scores together– report top-ranking authorities and hubs

hEa T'Eah '

Page 176: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 202

The link-content hypothesis

• Topical locality: page is similar () to the page that points to it ().

• Davison (TF*IDF, 100K pages)– 0.31 same domain

– 0.23 linked pages

– 0.19 sibling

– 0.02 random

• Menczer (373K pages, non-linear least squares fit)

• Chakrabarti (focused crawling) - prob. of losing the topic

Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001

21)1()(

e 03.01=1.8, 2=0.6,

Page 177: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 203

Measuring the Web

Page 178: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 204

Bharat and Broder 1998

• Based on crawls of HotBot, Altavista, Excite, and InfoSeek

• 10,000 queries in mid and late 1997

• Estimate is 200M pages

• Only 1.4% are indexed by all of them

Page 179: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 205

Example (from Bharat&Broder)

A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).

Page 180: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 212

Question answering

Page 181: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 213

People ask questions

• Excite corpus of 2,477,283 queries (one day’s worth)

• 8.4% of them are questions– 43.9% factual (what is the country code for

Belgium)– 56.1% procedural (how do I set up TCP/IP) or

other

• In other words, 100 K questions per day

Page 182: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 214

People ask questionsIn what year did baseball become an offical sport?Who is the largest man in the world?Where can i get information on Raphael?where can i find information on puritan religion?Where can I find how much my house is worth?how do i get out of debt?Where can I found out how to pass a drug test?When is the Super Bowl?who is California's District State Senator?where can I buy extra nibs for a foutain pen?how do i set up tcp/ip ?what time is it in west samoa?Where can I buy a little kitty cat?what are the symptoms of attention deficit disorder?Where can I get some information on Michael Jordan?How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?When did the Neanderthal man live?Which Frenchman declined the Nobel Prize for Literature for ideological reasons?What is the largest city in Northern Afghanistan?

How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?When did the Neanderthal man live?

Page 183: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 215

Page 184: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 216

Question answering

What is the largest city in Northern Afghanistan?

Page 185: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 217

Possible approaches

• Map?• Knowledge base

Find x: city (x) located (x,”Northern Afghanistan”) ¬exists (y): city (y) located (y,”Northern Afghanistan”) greaterthan (population (y), population (x))

• Database?• World factbook?• Search engine?

Page 186: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 218

The TREC Q&A evaluation

• Run by NIST [Voorhees and Tice 2000]• 2GB of input• 200 questions• Essentially fact extraction

– Who was Lincoln’s secretary of state?– What does the Peugeot company manufacture?

• Questions are based on text• Answers are assumed to be present• No inference needed

Page 187: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 219

Q: When did Nelson Mandela become president of South Africa?

A: 10 May 1994

Q: How tall is the Matterhorn?

A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches

Q: How tall is the replica of the Matterhorn at Disneyland?

A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years

Q: If Iraq attacks a neighboring country, what should the US do?

A: ??

Question answering

Page 188: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 220

Q: Why did David Koresh ask the FBI for a word processor?Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies".Q: What is the brightest star visible from Earth?Q: What are the Valdez Principles?Q: Name a film that has won the Golden Bear in the Berlin Film Festival?Q: Name a country that is developing a magnetic levitation railway system?Q: Name the first private citizen to fly in space.Q: What did Shostakovich write for Rostropovich?Q: What is the term for the sum of all genetic material in a given organism?Q: What is considered the costliest disaster the insurance industry has ever faced?Q: What is Head Start?Q: What was Agent Orange used for during the Vietnam War?Q: What did John Hinckley do to impress Jodie Foster?Q: What was the first Gilbert and Sullivan opera?Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics?Q: How did Socrates die?Q: Why are electric cars less efficient in the north-east than in California?

Page 189: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 221

NSIR

• Current project at U-M– http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi

• Reading:– [Radev et al., 2005a]

• Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005

• http://tangra.si.umich.edu/~radev/bib2html/radev-bib.html

Page 190: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 222

Page 191: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 223

... Afghanistan, Kabul, 2,450 ... Administrative capital and largest city (1997 est ... Undetermined.Panama, Panama City, 450,668. ... of the Gauteng, Northern Province, Mpumalanga ... www.infoplease.com/cgi-bin/id/A0855603

... died in Kano, northern Nigeria's largest city, during two days of anti-American riotsled by Muslims protesting the US-led bombing of Afghanistan, according to ... www.washingtonpost.com/wp-dyn/print/world/

... air strikes on the city. ... the Taliban militia in northern Afghanistan in a significantblow ... defection would be the largest since the United States ... www.afgha.com/index.php - 60k

... Kabul is the capital and largest city of Afghanistan. . ... met. area pop. 2,029,889),is the largest city in Uttar Pradesh, a state in northern India. . ... school.discovery.com/homeworkhelp/worldbook/atozgeography/ k/k1menu.html

... Gudermes, Chechnya's second largest town. The attack ... location in Afghanistan's outlyingregions ... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated ... english.pravda.ru/hotspots/2001/09/17/

... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the ... Region: EducationOffers Women in Northern Afghanistan a Ray of Hope. ... www.nytimes.com/pages/world/asia/

... within three miles of the airport at Mazar-e-Sharif, the largest city in northernAfghanistan, held since 1998 by the Taliban. There was no immediate comment ... uk.fc.yahoo.com/photos/a/afghanistan.html

Google

Page 192: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 224

Document retrieval

Query modulation

Sentence retrieval

Answer extraction

Answer ranking

What is the largest city in Northern Afghanistan?

(largest OR biggest) city “Northern Afghanistan”

www.infoplease.com/cgi-bin/id/A0855603www.washingtonpost.com/wp-dyn/print/world/

Gudermes, Chechnya's second largest town … location in Afghanistan's outlying regionswithin three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan

GudermesMazer-e-Sharif

Mazer-e-SharifGudermes

Page 193: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 225

Page 194: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 226

Page 195: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 227

Research problems• Source identification:

– semi-structured vs. text sources

• Query modulation:– best paraphrase of a NL question given the syntax of a search engine?– Compare two approaches: noisy channel model and rule-based

• Sentence ranking– n-gram matching, Okapi, co-reference?

• Answer extraction– question type identification– phrase chunking– no general-purpose named entity tagger available

• Answer ranking– what are the best predictors of a phrase being the answer to a given

question: question type, proximity to query words, frequency

• Evaluation (MRDR)– accuracy, reliability, timeliness

Page 196: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 228

Document retrieval

• Use existing search engines: Google, AlltheWeb, NorthernLight

• No modifications to question

• CF: work on QASM (ACM CIKM 2001)

Page 197: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 229

F

tfwtfwidftfwS

N

kk

N

jj

N

iii

i

321

13

12

11 ****

Sentence ranking

• Weighted N-gram matching:

• Weights are determined empirically, e.g., 0.6, 0.3, and 0.1

Page 198: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 230

Probabilistic phrase reranking

• Answer extraction: probabilistic phrase reranking. What is:

p(ph is answer to q | q, ph)

• Evaluation: TRDR– Example: (2,8,10) gives .725– Document, sentence, or phrase level

• Criterion: presence of answer(s)

• High correlation with manual assessment

n

iirn 1

11

Page 199: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 231

Phrase types

PERSON PLACE DATE NUMBER DEFINITIONORGANIZATION DESCRIPTION ABBREVIATIONKNOWNFOR RATE LENGTH MONEY REASONDURATION PURPOSE NOMINAL OTHER

Page 200: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 232

Question Type Identification• Wh-type not sufficient:

• Who: PERSON 77, DESCRIPTION 19, ORG 6• What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16,

NUMBER 14, etc.• How: NUMBER 33, LENGTH 6, RATE 2, etc.

• Ripper:– 13 features: Question-Words, Wh-Word, Word-Beside-Wh-

Word, Is-Noun-Length, Is-Noun-Person, etc.– Top 2 question types

• Heuristic algorithm:– About 100 regular expressions based on words and parts of

speech

Page 201: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 233

Ripper performance

-20.69%-TREC8,9,10

30%17.03%TREC10TREC8,9

24%22.4%TREC8TREC9

Test Error Rate

Train Error Rate

TestTraining

Page 202: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 234

Regex performance

7.6%5.5%4.6%TREC8,9,10

18.2%6%7.4%TREC8,9

18%15%7.8%TREC9

Test on TREC10

Test on TREC8

Test on TREC9

Training

Page 203: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 235

Phrase ranking

• Phrases are identified by a shallow parser (ltchunk from Edinburgh)

• Four features:– Proximity– POS (part-of-speech) signature (qtype)– Query overlap– Frequency

Page 204: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 236

Proximity

• Phrasal answers tend to appear near words from the query

• Average distance = 7 words, range = 1 to 50 words

• Use linearrescalingof scores

Page 205: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 237

Part of speech signature

NO (100%)NO (86.7%) PERSON (3.8%) NUMBER (3.8%) ORG (2.5%)PERSON (37.4%) PLACE (29.6%) DATE (21.7%) NO (7.6%)NO (75.6%) NUMBER (11.1%) PLACE (4.4%) ORG (4.4%)PLACE (37.3%) PERSON (35.6%) NO (16.9%) ORG (10.2%)ORG (55.6%) NO (33.3%) PLACE (5.6%) DATE (5.6%)

VBDDT NNNNPDT JJ NNPNNP NNPDT NNP

Phrase TypesSignature

Example: “Hugo/NNP Young/NNP”P (PERSON | “NNP NNP”) = .458

Example: “the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN”P (PERSON | “DT NNP NNP NNP NN”) = 0

Penn Treebank tagset (DT = determiner, JJ = adjective)

Page 206: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 238

Query overlap and frequency

• Query overlap:– What is the capital of Zimbabwe?– Possible choices:

Mugabe, Zimbabwe, Luanda, Harare

• Frequency:– Not necessarily accurate but rather useful

Page 207: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 239

Reranking

Rank Probability and phrase

1 0.599862 the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN ._.2 0.598564 International_NNP Space_NNP Station_NNP Alpha_NNP3 0.598398 International_NNP Space_NNP Station_NNP4 0.598125 to_TO become_VB5 0.594763 a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP6 0.593933 NASA_NNP Johnson_NNP Space_NNP Center_NNP7 0.587140 will_MD form_VB8 0.585410 The_DT purpose_NN9 0.576797 prime_JJ contracts_NNS10 0.568013 First_NNP American_NNP11 0.567361 this_DT bulletin_NN board_NN12 0.565757 Space_NNP :_:13 0.562627 'Spirit_NN '_'' of_IN...41 0.516368 Alan_NNP Shepard_NNP

Proximity = .5164

Page 208: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 240

Reranking

Rank Probability and phrase

1 0.465012 Space_NNP Administration_NNP ._.2 0.446466 SPACE_NNP CALENDAR_NNP _.3 0.413976 First_NNP American_NNP4 0.399043 International_NNP Space_NNP Station_NNP Alpha_NNP5 0.396250 her_PRP$ third_JJ space_NN mission_NN6 0.395956 NASA_NNP Johnson_NNP Space_NNP Center_NNP7 0.394122 the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP8 0.390163 the_DT Red_NNP Planet_NNP ._.9 0.379797 First_NNP American_NNP10 0.376336 Alan_NNP Shepard_NNP11 0.375669 February_NNP12 0.374813 Space_NNP13 0.373999 International_NNP Space_NNP Station_NNP

Qtype = .7288Proximity * qtype = .3763

Page 209: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 241

Reranking

Rank Probability and phrase

1 0.478857 Neptune_NNP Beach_NNP ._.2 0.449232 February_NNP3 0.447075 Go_NNP4 0.437895 Space_NNP5 0.431835 Go_NNP6 0.424678 Alan_NNP Shepard_NNP7 0.423855 First_NNP American_NNP8 0.421133 Space_NNP May_NNP9 0.411065 First_NNP American_NNP woman_NN10 0.401994 Life_NNP Sciences_NNP11 0.385763 Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN12 0.381865 the_DT Moon_NNP International_NNP Space_NNP Station_NNP13 0.370030 Space_NNP Research_NNP A_NNP Session_NNP

All four features

Page 210: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 242

Page 211: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 243

Page 212: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 244

Page 213: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 245

Document level performance

164163149#>0

1.33611.04950.8355Avg

GoogleNLightAlltheWebEngine

TREC 8 corpus (200 questions)

Page 214: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 246

Sentence level performance

1351371591191211599999148#>0

0.490.542.550.440.482.530.260.312.13Avg

GOO

GOL

GOU

NLO

NLL

NLU

AWO

AWL

AWU

Engine

Page 215: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 247

Phrase level performance

0.1990.1570.1170.105Combined

0.06460.0580.0540.038Global proximity

0.06460.0680.0480.026Appearance order

1.9412.6982.6522.176Upperbound

Google S+PGoogle D+PNorthernLightAlltheWeb

Experiments performedOct-Nov. 2001

Page 216: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 248

Page 217: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 249

Page 218: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 250

Text classification

Page 219: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 251

Introduction

• Text classification: assigning documents to predefined categories

• Hierarchical vs. flat• Many techniques: generative (maxent, knn, Naïve

Bayes) vs. discriminative (SVM, regression)• Generative: model joint prob. p(x,y) and use

Bayesian prediction to compute p(y|x)• Discriminative: model p(y|x) directly.

Page 220: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 252

Generative models: knn

• K-nearest neighbors

• Very easy to program

• Issues: choosing k, b?

)(

),(),(qdkNNd

qcq ddsbdcscore

Page 221: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 253

Feature selection: The 2 test

• For a term t:

• Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0)– P(C=0) = (k00+k01)/n– P(C=1) = 1-P(C=0) = (k10+k11)/n– P(It=0) = (k00+K10)/n– P(It=1) = 1-P(It=0) = (k01+k11)/n

It

0 1

C 0 k00 k01

1 k10 k11

Page 222: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 254

Feature selection: The 2 test

• High values of 2 indicate lower belief in independence.

• In practice, compute 2 for all words and pick the top k among them.

))()()((

)(

0010011100011011

2011000112

kkkkkkkk

kkkknΧ

Page 223: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 255

Feature selection: mutual information

• No document length scaling is needed

• Documents are assumed to be generated according to the multinomial model

x y yPxP

yxPyxPYXMI

)()(

),(log),(),(

Page 224: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 256

Naïve Bayesian classifiers

• Naïve Bayesian classifier

• Assuming statistical independence

),(

)()|,...,(),...,|(

,...21

2121

k

kk FFFP

CdPCdFFFPFFFCdP

k

j j

k

j j

kFP

CdPCdFPFFFCdP

1

121

)(

)()|(),...,|(

Page 225: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 257

Spam recognitionReturn-Path: <[email protected]>X-Sieve: CMU Sieve 2.2From: "Ibrahim Galadima" <[email protected]>Reply-To: [email protected]: [email protected]: Tue, 14 Jan 2003 21:06:26 -0800Subject: Gooday

DEAR SIR

FUNDS FOR INVESTMENTS

THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HADNO PREVIOUS CORRESPONDENCE WITH YOU

I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENTNATIONAL ELECTORAL COMMISSION INEC I GOT YOURCONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLEPERSON WITH WHOM TO HANDLE A VERY CONFIDENTIALTRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED ATTWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATESDOLLARS US$20M TO A SAFE FOREIGN ACCOUNT

THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITHARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OFOVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

Page 226: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 258

Well-known datasets• 20 newsgroups

– http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/

• Reuters-21578– Cats: grain, acquisitions, corn, crude, wheat, trade…

• WebKB– http://www-2.cs.cmu.edu/~webkb/– course, student, faculty, staff, project, dept, other– NB performance (2000)– P=26,43,18,6,13,2,94– R=83,75,77,9,73,100,35

Page 227: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 259

Support vector machines

• Introduced by Vapnik in the early 90s.

Page 228: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 260

Semi-supervised learning

• EM

• Co-training

• Graph-based

Page 229: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 261

Additional topics

• Soft margins

• VC dimension

• Kernel methods

Page 230: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 262

• SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.

• NB also good in many circumstances

Page 231: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 263

Readings• Books:• 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison-

Wesley/ACM Press, 1999.• 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic

Methods and Algorithms; Wiley, 2003, ISBN: 0-470-84906-1

• Papers:• Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999• Bharat and Broder "A technique for measuring the relative size and overlap of public Web search

engines" WWW 1998• Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998 • Bush "As we may thing" The Atlantic Monthly 1945 • Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999• Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998• Davison "Topical locality on the Web" SIGIR 2000• Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999• Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS

41(6) 1990

Page 232: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 264

Readings• Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text

Summarization" JAIR 22, 2004• Jeong and Barabasi "Diameter of the world wide web" Nature (401) 130-131, 1999• Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC

2000• Haveliwala "Topic-sensitive pagerank" WWW 2002• Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph"

PODS 2000• Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109,

1999• Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998• Menczer "Links tell us about lexical and semantic Web content" arXiv 2001• Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to

the Web" Stanford TR, 1998• Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web"

JASIST 2005• Singhal "Modern Information Retrieval: an Overview" IEEE 2001

Page 233: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 265

More readings• Gerard Salton, Automatic Text Processing, Addison-

Wesley (1989)• Gerald Kowalski, Information Retrieval Systems: Theory

and Implementation, Kluwer (1997)• Gerard Salton and M. McGill, Introduction to Modern

Information Retrieval, McGraw-Hill (1983)• C. J. an Rijsbergen, Information Retrieval, Buttersworths

(1979)• Ian H. Witten, Alistair Moffat, and Timothy C. Bell,

Managing Gigabytes, Van Nostrand Reinhold (1994)• ACM SIGIR Proceedings, SIGIR Forum• ACM conferences in Digital Libraries

Page 234: (C) 2005, The University of Michigan1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

(C) 2005, The University of Michigan 266

Thank you!

Благодаря!