1 cs 430: information discovery lecture 16 thesaurus construction
TRANSCRIPT
1
CS 430: Information Discovery
Lecture 16
Thesaurus Construction
2
Course Administration
• Midterm examinationGrades will be mailed over the weekendAnswer books will not be returnedMost questions will be discussed in classQuestion paper will be posted on the course web site
3
Decisions in creating a thesaurus
1. Which terms should be included in the thesaurus?
2. How should the terms be grouped?
4
Terms to include
• Only terms that are likely to be of interest for content identification
• Ambiguous terms should be coded for the senses likely to be important in the document collection
• Each thesaurus class should have approximately the same frequency of occurrence
• Terms of negative discrimination should be eliminated
after Salton and McGill
5
Discriminant value
Discriminant value is the degree to which a term is able to discriminate between the documents of a collection
= (average document similarity without term k) - (average document similarity with term k)
Good discriminators decrease the average document similarity
Note that this definition uses the document similarity.
6
Incidence array
D1: alpha bravo charlie delta echo foxtrot golf
D2: golf golf golf delta alpha
D3: bravo charlie bravo echo foxtrot bravo
D4: foxtrot alpha alpha golf golf delta
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 1
D3 1 1 1 1
D4 1 1 1 1
7
3
4
4
7
Document similarity matrix
D1 D2 D3 D4
D1 0.65 0.76 0.76
D2 0.65 0.00 0.87
D3 0.76 0.00 0.25
D4 0.76 0.87 0.25
Average similarity = 0.55
8
Discriminant value
Average similarity = 0.55
without average similarity DV
alpha 0.53 -0.02
bravo 0.56 +0.01
charlie 0.56 +0.01
delta 0.53 -0.02
echo 0.56 +0.01
foxtrot 0.52 -0.03
golf 0.53 -0.02
9
Similarities
Automatic thesaurus construction depends on a measure of similarity between terms
One measure of similarity is the number of documents that have terms i and k in common:
S(tj, tk) = tijtik
where tij if document i contains term j and 0 otherwise.i=1
n
10
Similarity measures
Improved similarity measures can be generated by:
• Using term frequency matrix instead of incidence matrix
• Weighting terms by frequency:
cosine measure
tijtik
|tj| |tk|
dice measure
tijtik
tik + tij
i=1
n
i=1
i=1 i=1
n
n n
S(tj, tk) =
S(tj, tk) =
11
Similarities: Incidence array
D1: alpha bravo charlie delta echo foxtrot golf
D2: golf golf golf delta alpha
D3: bravo charlie bravo echo foxtrot bravo
D4: foxtrot alpha alpha golf golf delta
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 1
D3 1 1 1 1
D4 1 1 1 1
n 3 2 2 3 2 3 3
12
Term similarity matrix
alpha bravo charlie delta echo foxtrot golf
alpha 0.2 0.2 0.5 0.2 0.33 0.5
bravo 0.2 0.5 0.2 0.5 0.4 0.2
charlie 0.2 0.5 0.2 0.5 0.4 0.2
delta 0.5 0.2 0.2 0.2 0.33 0.5
echo 0.2 0.5 0.5 0.2 0.4 0.2
foxtrot 0.33 0.4 0.4 0.33 0.4 0.33
golf 0.5 0.2 0.2 0.5 0.2 0.33
Using incidence matrix and dice weighting
13
Clustering -- nearest neighbor
alpha delta
1
golf
2
echobravo
3
6
charlie
4
5
foxtrot
14
Phrase construction
In a thesaurus, term classes may contain phrases.
Informal definitions:
pair-frequency (i, j) is the frequency that a pair of words occur in context (e.g., in succession within a sentence)
phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency
cohesion (i, j) = pair-frequency (i, j)
frequency(i)*frequency(j)
15
Phrase construction
Salton and McGill algorithm
1. Computer pair-frequency for all terms.
2. Reject all pairs that fall below a certain threshold
3. Calculate cohesion values
4. If cohesion above a threshold value, consider word pair as a phrase.
Automatic phrase construction by statistical methods is rarely used in practice. There is promising research on phrase identification using methods of computational linguistics
16
Types of Information Discovery
media type
text image, video, audio, etc.
searching browsing
linking
statistical user-in-loopcatalogs, indexes (metadata)
CS 502
natural language
processing
CS 474