1 cs 430: information discovery lecture 16 thesaurus construction

16
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

Upload: tobias-harris

Post on 20-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

1

CS 430: Information Discovery

Lecture 16

Thesaurus Construction

Page 2: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

2

Course Administration

• Midterm examinationGrades will be mailed over the weekendAnswer books will not be returnedMost questions will be discussed in classQuestion paper will be posted on the course web site

Page 3: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

3

Decisions in creating a thesaurus

1. Which terms should be included in the thesaurus?

2. How should the terms be grouped?

Page 4: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

4

Terms to include

• Only terms that are likely to be of interest for content identification

• Ambiguous terms should be coded for the senses likely to be important in the document collection

• Each thesaurus class should have approximately the same frequency of occurrence

• Terms of negative discrimination should be eliminated

after Salton and McGill

Page 5: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

5

Discriminant value

Discriminant value is the degree to which a term is able to discriminate between the documents of a collection

= (average document similarity without term k) - (average document similarity with term k)

Good discriminators decrease the average document similarity

Note that this definition uses the document similarity.

Page 6: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

6

Incidence array

D1: alpha bravo charlie delta echo foxtrot golf

D2: golf golf golf delta alpha

D3: bravo charlie bravo echo foxtrot bravo

D4: foxtrot alpha alpha golf golf delta

alpha bravo charlie delta echo foxtrot golf

D1 1 1 1 1 1 1 1

D2 1 1 1

D3 1 1 1 1

D4 1 1 1 1

7

3

4

4

Page 7: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

7

Document similarity matrix

D1 D2 D3 D4

D1 0.65 0.76 0.76

D2 0.65 0.00 0.87

D3 0.76 0.00 0.25

D4 0.76 0.87 0.25

Average similarity = 0.55

Page 8: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

8

Discriminant value

Average similarity = 0.55

without average similarity DV

alpha 0.53 -0.02

bravo 0.56 +0.01

charlie 0.56 +0.01

delta 0.53 -0.02

echo 0.56 +0.01

foxtrot 0.52 -0.03

golf 0.53 -0.02

Page 9: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

9

Similarities

Automatic thesaurus construction depends on a measure of similarity between terms

One measure of similarity is the number of documents that have terms i and k in common:

S(tj, tk) = tijtik

where tij if document i contains term j and 0 otherwise.i=1

n

Page 10: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

10

Similarity measures

Improved similarity measures can be generated by:

• Using term frequency matrix instead of incidence matrix

• Weighting terms by frequency:

cosine measure

tijtik

|tj| |tk|

dice measure

tijtik

tik + tij

i=1

n

i=1

i=1 i=1

n

n n

S(tj, tk) =

S(tj, tk) =

Page 11: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

11

Similarities: Incidence array

D1: alpha bravo charlie delta echo foxtrot golf

D2: golf golf golf delta alpha

D3: bravo charlie bravo echo foxtrot bravo

D4: foxtrot alpha alpha golf golf delta

alpha bravo charlie delta echo foxtrot golf

D1 1 1 1 1 1 1 1

D2 1 1 1

D3 1 1 1 1

D4 1 1 1 1

n 3 2 2 3 2 3 3

Page 12: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

12

Term similarity matrix

alpha bravo charlie delta echo foxtrot golf

alpha 0.2 0.2 0.5 0.2 0.33 0.5

bravo 0.2 0.5 0.2 0.5 0.4 0.2

charlie 0.2 0.5 0.2 0.5 0.4 0.2

delta 0.5 0.2 0.2 0.2 0.33 0.5

echo 0.2 0.5 0.5 0.2 0.4 0.2

foxtrot 0.33 0.4 0.4 0.33 0.4 0.33

golf 0.5 0.2 0.2 0.5 0.2 0.33

Using incidence matrix and dice weighting

Page 13: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

13

Clustering -- nearest neighbor

alpha delta

1

golf

2

echobravo

3

6

charlie

4

5

foxtrot

Page 14: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

14

Phrase construction

In a thesaurus, term classes may contain phrases.

Informal definitions:

pair-frequency (i, j) is the frequency that a pair of words occur in context (e.g., in succession within a sentence)

phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency

cohesion (i, j) = pair-frequency (i, j)

frequency(i)*frequency(j)

Page 15: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

15

Phrase construction

Salton and McGill algorithm

1. Computer pair-frequency for all terms.

2. Reject all pairs that fall below a certain threshold

3. Calculate cohesion values

4. If cohesion above a threshold value, consider word pair as a phrase.

Automatic phrase construction by statistical methods is rarely used in practice. There is promising research on phrase identification using methods of computational linguistics

Page 16: 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

16

Types of Information Discovery

media type

text image, video, audio, etc.

searching browsing

linking

statistical user-in-loopcatalogs, indexes (metadata)

CS 502

natural language

processing

CS 474