(c) 2003, the university of michigan1 information retrieval handout #3 february 10, 2003

44
(C) 2003, The University of Michigan 1 Information Retrieval Handout #3 February 10, 2003

Upload: oswin-russell

Post on 25-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #3

February 10, 2003

Page 2: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: M&F 11-12

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Mondays, 1-4 PM in 409 West Hall

Page 3: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 3

TF*IDF (cont’d)

Page 4: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 4

Vector-based matching

• The cosine measure

sim (D,C) =

(dk . ck . idf(k))

(dk)2 . (ck)2

k

k

k

Page 5: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 5

IDF: Inverse document frequency

N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i

idfk = log2(N/dk) + 1 = log2N - log2dk + 1

TF * IDF is used for automated indexing and for topicdiscrimination:

Page 6: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 6

Asian and European news622.941 deng306.835 china196.725 beijing153.608 chinese152.113 xiaoping124.591 jiang108.777 communist102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people

97.487 nato92.151 albright74.652 belgrade46.657 enlargement34.778 alliance34.778 french33.803 opposition32.571 russia14.095 government 9.389 told 9.154 would 8.459 their 6.059 which

Page 7: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 7

Other topics

120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center

74.652 compuserve65.321 massey55.989 salizzoni29.996 bob27.994 online27.198 executive15.890 interim15.271 chief11.647 service11.174 second 6.781 world 6.315 president

Page 8: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 8

Semantic networks

Page 9: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 9

Semantic Networks

• Used to represent relationships between words

• Example: WordNet - created by George Miller’s team at Princeton

• Based on synsets (synonyms, interchangeable words) and lexical matrices

Page 10: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 10

Lexical matrix

Word FormsWord

Meanings F1 F2 F3 … Fn

M1 E1,1 E1,2

M2 E1,2

……

Mm Em,n

Page 11: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 11

Synsets

• Disambiguation– {board, plank}– {board, committee}

• Synonyms– substitution– weak substitution– synonyms must be of the same part of speech

Page 12: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 12

$ ./wn board -hypen

Synonyms/Hypernyms (Ordered by Frequency) of noun board

9 senses of board

Sense 1board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping

Sense 2board => sheet, flat solid => artifact, artefact => object, physical object => entity, something

Sense 3board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something

Page 13: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 13

Sense 4display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 5board, gameboard => surface => artifact, artefact => object, physical object => entity, something

Sense 6board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something

Page 14: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 14

Sense 7control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 8circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 9dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Page 15: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 15

Antonymy

• “x” vs. “not-x”

• “rich” vs. “poor”?

• {rise, ascend} vs. {fall, descend}

Page 16: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 16

Other relations

• Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”.

• Hyponymy: {tree} is a hyponym of {plant}.

• Hierarchical structure based on hyponymy (and hypernymy).

Page 17: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 17

Other features of WordNet

• Index of familiarity

• Polysemy

Page 18: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 18

board used as a noun is familiar (polysemy count = 9)

bird used as a noun is common (polysemy count = 5)

cat used as a noun is common (polysemy count = 7)

house used as a noun is familiar (polysemy count = 11)

information used as a noun is common (polysemy count = 5)

retrieval used as a noun is uncommon (polysemy count = 3)

serendipity used as a noun is very rare (polysemy count = 1)

Familiarity and polysemy

Page 19: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 19

Compound nouns

advisory boardappeals boardbackboardbackgammon boardbaseboardbasketball backboardbig boardbillboardbinder's boardbinder board

blackboardboard gameboard measureboard meetingboard memberboard of appealsboard of directorsboard of educationboard of regentsboard of trustees

Page 20: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 20

Overview of senses1. board -- (a committee having supervisory powers; "the board has seven members")2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)4. display panel, display board, board -- (a board on which information can be displayed to public view)5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")

Page 21: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 21

Top-level concepts

{act, action, activity}

{animal, fauna}

{artifact}

{attribute, property}

{body, corpus}

{cognition, knowledge}

{communication}

{event, happening}

{feeling, emotion}

{food}

{group, collection}

{location, place}

{motive}

{natural object}

{natural phenomenon}

{person, human being}

{plant, flora}

{possession}

{process}

{quantity, amount}

{relation}

{shape}

{state, condition}

{substance}

{time}

Page 22: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 22

Properties of words

Page 23: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 23

Word distributions

• Negative binomial distribution

• In the Brown corpus– the word “said” has p = 9.24 and α = 0.42

kk ppk

kkF

)1(1

)(

Page 24: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 24

Vocabulary growth

• Heaps’ Law

• V = vocabulary size

• V = Knβ, where K and β depend on the text

• K is typically between 10 and 100, and β is less than 1 (for TREC-2 it’s between 0.4 and 0.6)

Page 25: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 25

Word length

• In TREC-2, word length is 5 characters on average.

• If stop words are removed, average length increases to a range from 6 to 7.

Page 26: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 26

Word similarity

• Hamming distance - when words are of the same length

• Levenshtein distance - number of edits (insertions, deletions, replacements)– color --> colour (1)– survey --> surgery (2)– com puter --> computer ?

• Longest common subsequence (LCS)– lcs (survey, surgery) = surey

Page 27: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 27

Approximate string matching

• The Soundex algorithm (Odell and Russell)

• Uses:– spelling correction– hash function– non-recoverable

Page 28: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 28

The Soundex algorithm

1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions

2. Assign the following numbers to the remaining letters after the first:b,f,p,v : 1

c,g,j,k,q,s,x,z : 2

d,t : 3

l : 4

m n : 5

r : 6

Page 29: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 29

The Soundex algorithm

3. if two or more letters with the same code were adjacent in the original name, omit all but the first

4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits

Examples:

Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300

same as Ellery, Ghosh, Heilbronn, Kant, and Ladd

Some problems: Rogers and Rodgers, Sinclair and StClair

Page 30: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 30

Compression

Page 31: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 31

Compression

• Huffman coding (prefix property)

• Ziv-Lempel codes (better)

Page 32: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 32

Huffman coding

• Developed by David Huffman (1952)

• Average of 5 bits per character

• Based on frequency distributions of symbols

• Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

Page 33: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 33

Symbol Frequency

A 7

B 4

C 10

D 5

E 2

F 11

G 15

H 3

I 7

J 8

Page 34: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 34

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

c

b d

f

g

i j

he

a

Page 35: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 35

Symbol Code

A 0110

B 0010

C 000

D 0011

E 01110

F 010

G 10

H 01111

I 110

J 111

Page 36: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 36

Exercise 1

• Consider the bit string: 01101101111000100110001110100111000110101101011101

• Use the Huffman code from the example to decode it.

• Try inserting, deleting, and switching some bits at random locations and try decoding.

Page 37: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 37

Ziv-Lempel coding

• Two types - one is known as LZ77 (used in GZIP)

• Code: set of triples <a,b,c>• a: how far back in the decoded text to look

for the upcoming text segment• b: how many characters to copy• c: new character to add to complete segment

Page 38: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 38

• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

Page 39: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 39

No. of triples Average textlength

No. of codetriples

Average textlength

1 1.00 11 1.82

2 1.00 12 1.92

3 1.00 13 2.00

4 1.25 14 1.93

5 1.20 15 1.87

6 1.33 16 2.13

7 1.57 17 2.12

8 1.88 18 2.22

9 1.78 19 2.26

10 1.80 20 2.20

Page 40: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 40

Markup languages

Page 41: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 41

Markup languages

• HTML

• SGML

• XML

Page 42: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 42

HTML

• Focus on presentation, not content

Page 43: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 43

<!SGML "ISO 8879:1986" CHARSETBASESET "ISO 646-1983//CHARSETInternational Reference Version (IRV)//ESC 2/5 4/0"DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED BASESET "ISO Registration Number 109//CHARSET ECMA-94 Right Part of Latin-1 Alphabet Nr.3//ESC 2/9 4/3" DESCSET 128 32 UNUSED -- no such characters -- 160 1 UNUSED -- nbs character -- 161 94 161 -- 161 through 254 inclusive -- 255 1 UNUSED

CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"SCOPE DOCUMENTSYNTAXSHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255BASESET "ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0"DESCSET 0 128 0FUNCTION RE 13 RS 10 SPACE 32 TAB SEPCHAR 9NAMING LCNMSTRT "" UCNMSTRT "" LCNMCHAR "_-." UCNMCHAR "_-." NAMECASE GENERAL NO ENTITY NODELIM GENERAL SGMLREF SHORTREF SGMLREFNAMES SGMLREFQUANTITY SGMLREF ATTCNT 99999999 ATTSPLEN 99999999 DTEMPLEN 24000 ENTLVL 99999999 GRPCNT 99999999 GRPGTCNT 99999999 GRPLVL 99999999 LITLEN 24000 NAMELEN 99999999

PILEN 24000 TAGLEN 99999999 TAGLVL 99999999 FEATURES

MINIMIZE DATATAG NO OMITTAG YES RANK YES SHORTTAG YESLINK SIMPLE YES 1000 IMPLICIT YES EXPLICIT YES 1OTHER CONCUR NO SUBDOC YES 99999999 FORMAL YES APPINFO NONE>

<!DOCTYPE DOCSET [<!--File: asr.dtdAuthor: Jon Fiscus, NISTDesc: This DTD is intended to parse a TDT2 .tkn file.

--><!ELEMENT DOCSET - O (X|W)+><!ELEMENT X - O EMPTY ><!ELEMENT W - O CDATA >

<!ATTLIST DOCSET type (ASRTEXT|NEWSWIRE|CAPTION|TRANSCRIPT|SYSTRAN|ASR_SYSTRAN) #REQUIRED fileid CDATA #REQUIRED collect_date CDATA #REQUIRED collect_src CDATA #REQUIRED src_lang CDATA #REQUIRED content_lang CDATA #REQUIRED proc_remarks CDATA #IMPLIED >

<!ATTLIST W recid CDATA #REQUIRED Bsec CDATA #IMPLIED Dur CDATA #IMPLIED Clust CDATA #IMPLIED Conf CDATA #IMPLIED tr (Y|N) #IMPLIED >

<!ATTLIST X Bsec CDATA #IMPLIED Dur CDATA #IMPLIED Conf (NA) #IMPLIED >]>

SGML

Page 44: (C) 2003, The University of Michigan1 Information Retrieval Handout #3 February 10, 2003

(C) 2003, The University of Michigan 44

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE DOCSENT SYSTEM "../../../../../dtd/docsent.dtd" ><DOCSENT DID='D-20000408_011.e' DOCNO='17706' LANG='ENG' CORR-DOC='D-20000408_017.c'><BODY><HEADLINE><S PAR="1" RSNT="1" SNO="1"> Beat Drugs Fund Grants $16 million in Support of 29 Anti-Drug Projects </S></HEADLINE><TEXT> <S PAR='2' RSNT='1' SNO='2'>The Governing Committee of the Beat Drugs Fund , chaired by the Secretary for Security , has approved grants of $16 .39 million for 29 anti-drug projects this year .</S><S PAR='3' RSNT='1' SNO='3'>The Commissioner for Narcotics , Mrs Clarie Lo , who is also a member of the Governing Committee , said , "The number of drug abusers aged below 21 dropped by 13 .6 per cent from 2829 in 1998 to 2 443 in 1999 .</S><S PAR='3' RSNT='2' SNO='4'>Despite the continuing drop in recent years , we recognise that youths-at-risk are a highly vulnerable group and deserve the full attention of all those working in the anti-drug field . "</S><S PAR='4' RSNT='1' SNO='5'> "To prevent our younger generation from abusing drugs , education and publicity is an on-going campaign; and any relaxation in efforts might have adverse consequences , " Mrs Lo added .</S><S PAR='5' RSNT='1' SNO='6'>In considering this year 's applications for the Fund , the Governing Committee attached importance to those aiming to steer youths-at-risk away from drugs .</S><S PAR='6' RSNT='1' SNO='7'>Amongst the 29 projects approved this year , 22 are related to drug prevention education and publicity ($10 .72 million) , five to treatment and rehabilitation ($2 .98 million)and two to research ($2 .69 million) .</S><S PAR='7' RSNT='1' SNO='8'>An amount of $2 .08 million was granted to conduct a pioneering longitudinal research on the development and validation of a drug prevention programme in Hong Kong .</S><S PAR='8' RSNT='1' SNO='9'>Youths-at-risk aged between 10 to 15 in selected areas including Tuen Mun and Kwun Tong will be invited to take part in the project .</S><S PAR='8' RSNT='2' SNO='10'>Participants will be taught on the adverse effect of drug abuse , social and personal skills to help them identify and resist peer influence to use drugs .</S></TEXT></BODY></DOCSENT>

<!-- DTD for sentence-segmented text -->

<!ELEMENT DOCSENT (EXTRACTION-INFO?, BODY)><!ATTLIST DOCSENT DID CDATA #REQUIRED DOCNO CDATA #IMPLIED LANG (CHIN|ENG) "ENG" CORR-DOC CDATA #IMPLIED> <!-- DID : documentid LANG: language -->

<!ELEMENT EXTRACTION-INFO EMPTY><!ATTLIST EXTRACTION-INFO SYSTEM CDATA #REQUIRED RUN CDATA #IMPLIED COMPRESSION CDATA #REQUIRED QID CDATA #REQUIRED>

<!ELEMENT BODY (HEADLINE?,TEXT)>

<!ELEMENT HEADLINE (S)*><!ELEMENT TEXT (S)*>

<!ELEMENT S (#PCDATA)> <!ATTLIST S PAR CDATA #REQUIRED RSNT CDATA #REQUIRED SNO CDATA #REQUIRED> <!-- PAR: paragraph no RSNT: relative sentence no (within paragraph) SNO: absolute sentence no -->

docsent.dtd

example.docsent

XML