2007.02.01 - slide 1is 240 – spring 2007 prof. ray larson university of california, berkeley...
Post on 20-Dec-2015
213 views
TRANSCRIPT
2007.02.01 - SLIDE 1IS 240 – Spring 2007
Prof. Ray Larson University of California, Berkeley
School of InformationTuesday and Thursday 10:30 am - 12:00 pm
Spring 2007http://courses.ischool.berkeley.edu/i240/s07
Principles of Information Retrieval
Lecture 6: Boolean to Vector
2007.02.01 - SLIDE 2IS 240 – Spring 2007
Today
• IR Models
• The Boolean Model
• Boolean implementation issues
2007.02.01 - SLIDE 4IS 240 – Spring 2007
IR Models
• Set Theoretic Models– Boolean– Fuzzy– Extended Boolean
• Vector Models (Algebraic)
• Probabilistic Models (probabilistic)
2007.02.01 - SLIDE 6IS 240 – Spring 2007
Parse Result (Query Tree)
• Z39.50 queries…
Oper: AND
Title XXX and Subject YYY
Operand:Index = TitleValue = XXX
Operand:Index = SubjectValue = YYY
left right
2007.02.01 - SLIDE 7IS 240 – Spring 2007
Parse Results
• Subject XXX and (title yyy and author zzz)
Op: AND
Op: ANDOper:
Index: SubjectValue: XXX
Oper:Index: TitleValue: YYY
Oper:Index: AuthorValue: ZZZ
2007.02.01 - SLIDE 8IS 240 – Spring 2007
Boolean AND Algorithm
2578
152935
100135140155189190195198
28
15100135155189195
289
1215222850687784
100120128135138141150155188189195
AND =
2007.02.01 - SLIDE 9IS 240 – Spring 2007
Boolean OR Algorithm
2578
152935
100135140155189190195198
25789
12152228293550687784
100120128135138141150155188189190195198
289
1215222850687784
100120128135138141150155188189195
OR =
2007.02.01 - SLIDE 10IS 240 – Spring 2007
Boolean AND NOTAlgorithm
2578
152935
100135140155189190195198
57
152935
140190198
289
1215222850687784
100120128135138141150155188189195
AND NOT =
2007.02.01 - SLIDE 11IS 240 – Spring 2007
Basic Concepts for Extended Boolean
• Instead of binary values, terms in documents and queries have a weight (importance or some other statistical property)
• Instead of binary set membership, sets are “fuzzy” and the weights are used to determine degree of membership.
• Degree of set membership can be used to rank the results of a query
2007.02.01 - SLIDE 12IS 240 – Spring 2007
Fuzzy Sets
• Introduced by Zadeh in 1965.
• If set {A} has value v(A) and {B} has value v(B), where 0 v 1
• v(AB) = min(v(A), v(B))
• v(AB) = max(v(A), v(B))
• v(~A) = 1-v(A)
2007.02.01 - SLIDE 13IS 240 – Spring 2007
Rule Evaluation Tree
World_Series (0.63)
Event (0.63)
“World Series” Baseball_championship (0.7)
Baseball (1.0)
Championship (0.7)
St._Louis_Cardinals (0)
Team (0)
“Cardinals” (0)
Milwaukee_brewers (0)
Cardinals_full_name (0)
“Milwaukee Brewers” (0)“Brewers” (0)
Saint (0) “Louis” (0)
“Saint” (0)“St.” (0)
“Cardinals” (0)
“baseball” (1.0)“championship” (1.0)“ball” (1.0)0.9
0.90.7 0.90.5
0.9
0.50.7
2007.02.01 - SLIDE 14IS 240 – Spring 2007
Boolean Limitations
• Advantages– simple queries are easy to understand– relatively easy to implement
• Disadvantages– difficult to specify what is wanted, particularly in
complex situations (E.g., RUBRIC Queries)– too much returned, or too little– ordering not well determined in Traditional Boolean– Ordering may be problematic in extended Boolean
(Robertson’s critique) – Weighting is based only on the query – or some
undefined weighting scheme must be used for the documents.
2007.02.01 - SLIDE 15IS 240 – Spring 2007
Lecture Overview
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files
• Vector Representation
• Term Weights
Credit for some of the slides in this lecture goes to Marti Hearst
2007.02.01 - SLIDE 16IS 240 – Spring 2007
Lecture Overview
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files
• Vector Representation
• Term Weights
• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
2007.02.01 - SLIDE 17IS 240 – Spring 2007
A Small Collection (Stems)Rank Freq Term1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form
150 2 enhanc151 2 energi152 2 emphasi153 2 detect154 2 desir155 2 date156 2 critic157 2 content158 2 consider159 2 concern160 2 compon161 2 compar162 2 commerci163 2 clause164 2 aspect165 2 area166 2 aim167 2 affect
2007.02.01 - SLIDE 18IS 240 – Spring 2007
The Corresponding Zipf Curve
Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form
2007.02.01 - SLIDE 19IS 240 – Spring 2007
Zipf Distribution
• The Important Points:– A few elements occur very frequently– A medium number of elements have medium
frequency– Many elements occur very infrequently
2007.02.01 - SLIDE 21IS 240 – Spring 2007
Related Distributions/”Laws”
• Bradford’s Law of Scattering
• Lotka’s Law of Productivity
• De Solla Price’s Urn Model for “Cumulative Advantage Processes”
½ = 50% 2/3 = 66% ¾ = 75%Pick Pick
Replace +1 Replace +1
2007.02.01 - SLIDE 22IS 240 – Spring 2007
Frequent Words on the WWW• 65002930 the• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are
• 30998255 from• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page
• 22292805 about• 22265579 com• 22107392 information• 21647927 will• 21368265 can• 21367950 more• 21102223 has• 20621335 no• 19898015 other• 19689603 one• 19613061 c• 19394862 d• 19279458 m• 19199145 was• 19075253 copyright• 18636563 us
(see http://elib.cs.berkeley.edu/docfreq/docfreq.html)
2007.02.01 - SLIDE 23IS 240 – Spring 2007
Word Frequency vs. Resolving Power
The most frequent words are not the most descriptive
(from van Rijsbergen 79)
2007.02.01 - SLIDE 24IS 240 – Spring 2007
Statistical Independence
• Two events x and y are statistically independent if the product of the probabilities of their happening individually equals the probability of their happening together
),()()( yxPyPxP
2007.02.01 - SLIDE 25IS 240 – Spring 2007
Lexical Associations
• Subjects write first word that comes to mind– doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora can yield similar associations• One measure: Mutual Information (Church and
Hanks 89)
• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)
)(),(
),(log),( 2 yPxP
yxPyxI
2007.02.01 - SLIDE 26IS 240 – Spring 2007
Interesting Associations with “Doctor”
I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor
11.3 8 1105 Doctors 44 Dentists
10.7 30 1105 Doctors 241 Nurses
9.4 8 1105 Doctors 154 Treating
9.0 6 275 Examined 621 Doctor
8.9 11 1105 Doctors 317 Treat
8.7 25 621 Doctor 1407 Bills
AP Corpus, N=15 million, Church & Hanks 89
2007.02.01 - SLIDE 27IS 240 – Spring 2007
I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with
0.95 41 284690 a 1105 doctors
0.93 12 84716 is 1105 doctors
These associations were likely to happen because the non-doctor words shown here are very common
and therefore likely to co-occur with any noun
Un-Interesting Associations with “Doctor”
AP Corpus, N=15 million, Church & Hanks 89
2007.02.01 - SLIDE 28IS 240 – Spring 2007
Content Analysis Summary
• Content Analysis: transforming raw text into more computationally useful forms
• Words in text collections exhibit interesting statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies
2007.02.01 - SLIDE 29IS 240 – Spring 2007
Lecture Overview
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files
• Vector Representation
• Term Weights
• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
2007.02.01 - SLIDE 30IS 240 – Spring 2007
Inverted Indexes
• We have seen “Vector files” conceptually– An Inverted File is a vector file “inverted” so
that rows become columns and columns become rows
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
2007.02.01 - SLIDE 31IS 240 – Spring 2007
Inverted File Structure
Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
2007.02.01 - SLIDE 32IS 240 – Spring 2007
Inverted Indexes
• Permit fast search for individual terms• For each term, you get a list consisting of:
– Document ID – Frequency of term in doc (optional) – Position of term in doc (optional)
• These lists can be used to solve Boolean queries:
• country -> d1, d2• manor -> d2• country AND manor -> d2
• Also used for statistical ranking algorithms
2007.02.01 - SLIDE 33IS 240 – Spring 2007
How Inverted Files are Used
Dictionary PostingsDoc # Freq
2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Query on
“time” AND “dark”
2 docs with “time” in dictionary ->
IDs 1 and 2 from posting file
1 doc with “dark” in dictionary ->
ID 2 from posting file
Therefore, only doc 2 satisfied the query
2007.02.01 - SLIDE 34IS 240 – Spring 2007
Lecture Overview
• Review– Boolean Searching– Content Analysis
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files• Vector Representation• Term Weights• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
2007.02.01 - SLIDE 35IS 240 – Spring 2007
Document Vectors
• Documents are represented as “bags of words”
• Represented as vectors when used computationally– A vector is like an array of floating point– Has direction and magnitude– Each vector holds a place for every term in
the collection– Therefore, most vectors are sparse
2007.02.01 - SLIDE 36IS 240 – Spring 2007
Vector Space Model
• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary or weighted vectors
of terms
• Queries represented the same as documents• Query and Document weights are based on
length and direction of their vector• A vector distance measure between the query
and documents is used to rank retrieved documents
2007.02.01 - SLIDE 37IS 240 – Spring 2007
Vector Representation
• Documents and Queries are represented as vectors
• Position 1 corresponds to term 1, position 2 to term 2, position t to term t
• The weight of the term is stored in each position
absent is terma if 0
,...,,
,...,,
21
21
w
wwwQ
wwwD
qtqq
dddi itii
2007.02.01 - SLIDE 38IS 240 – Spring 2007
Document Vectors + Frequency
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
2007.02.01 - SLIDE 39IS 240 – Spring 2007
Document Vectors + Frequency
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I
2007.02.01 - SLIDE 40IS 240 – Spring 2007
Document Vectors + Frequency
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
2007.02.01 - SLIDE 41IS 240 – Spring 2007
We Can Plot the Vectors
Star
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
2007.02.01 - SLIDE 42IS 240 – Spring 2007
Documents in 3D Space
Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning
2007.02.01 - SLIDE 43IS 240 – Spring 2007
Vector Space Documents and Queries
docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3
D10 0 1 1 5D11 0 0 1 4Q 1 2 3
q1 q2 q3
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t2
t3
t1
Boolean term combinationsQ is a query – also represented as a vector
2007.02.01 - SLIDE 44IS 240 – Spring 2007
Documents in Vector Space
t1
t2
t3
D1
D2
D10
D3
D9
D4
D7
D8
D5
D11
D6
2007.02.01 - SLIDE 45IS 240 – Spring 2007
Document Space has High Dimensionality
• What happens beyond 2 or 3 dimensions?
• Similarity still has to do with how many tokens are shared in common.
• More terms -> harder to understand which subsets of words are shared among similar documents.
• We will look in detail at ranking methods• Approaches to handling high
dimensionality: Clustering and LSI (later)
2007.02.01 - SLIDE 46IS 240 – Spring 2007
Lecture Overview
• Statistical Properties of Text– Zipf Distribution– Statistical Dependence
• Indexing and Inverted Files
• Vector Representation
• Term Weights
• Vector Matching
Credit for some of the slides in this lecture goes to Marti Hearst
2007.02.01 - SLIDE 47IS 240 – Spring 2007
Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf*idf– Recall the Zipf distribution– Want to weight terms highly if they are
• Frequent in relevant documents … BUT• Infrequent in the collection as a whole
• Automatically derived thesaurus terms
2007.02.01 - SLIDE 48IS 240 – Spring 2007
Binary Weights
• Only the presence (1) or absence (0) of a term is included in the vector
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1D11 1 0 1
2007.02.01 - SLIDE 49IS 240 – Spring 2007
Raw Term Weights
• The frequency of occurrence for the term in each document is included in the vector
docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1
D10 0 3 5D11 4 0 1
2007.02.01 - SLIDE 50IS 240 – Spring 2007
Assigning Weights
• tf*idf measure:– Term frequency (tf)– Inverse document frequency (idf)
• A way to deal with some of the problems of the Zipf distribution
• Goal: Assign a tf*idf weight to each term in each document
2007.02.01 - SLIDE 51IS 240 – Spring 2007
Simple tf*idf
)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
2007.02.01 - SLIDE 52IS 240 – Spring 2007
Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
For a collectionof 10000 documents(N = 10000)
2007.02.01 - SLIDE 53IS 240 – Spring 2007
Non-Boolean IR
• Need to measure some similarity between the query and the document
• The basic notion is that documents that are somehow similar to a query, are likely to be relevant responses for that query
• We will revisit this notion again and see how the Language Modelling approach to IR has taken it to a new level
2007.02.01 - SLIDE 54IS 240 – Spring 2007
Non-Boolean?
• To measure similarity we…– Need to consider the characteristics of the
document and the query– Make the assumption that similarity of
language use between the query and the document implies similarity of topic and hence, potential relevance.
2007.02.01 - SLIDE 55IS 240 – Spring 2007
Similarity Measures (Set-based)
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
Assuming that Q and D are the sets of terms associated with a Query and Document:
2007.02.01 - SLIDE 56IS 240 – Spring 2007
What form should these take?
• Each of the queries and documents might be considered as:– A set of terms (Boolean approach)
• “index terms”• “words”, stems, etc.
– Some other form?
2007.02.01 - SLIDE 57IS 240 – Spring 2007
Weighting schemes
• We have seen something of– Binary– Raw term weights– TF*IDF
• There are many other possibilities– IDF alone– Normalized term frequency– etc.