c.watterscsci64031 classical ir models. c.watterscsci64032 goal hit set of relevant documents ranked...
TRANSCRIPT
C.Watters csci6403 1
Classical IR Models
C.Watters csci6403 2
Goal
• Hit set of relevant documents
• Ranked set
• Best match
• Answer
C.Watters csci6403 3
Models
• Boolean (based on set theory)– fuzzy logic– Extended Boolean
• Vector Space (based on algebra)– Latent semantic networks– Neural networks
• Probabilistic– Inference networks– Belief networks
• Hypertext
C.Watters csci6403 4
Retrieval
• Ad hoc
• Repeated– filter – selective dissemination of information (SDI)– profile
• Browsing
C.Watters csci6403 5
Index terms K={k1,…,kn}
Migration to Australia
This page introduces information about migrating to Australia (as a migrant or refugee), which means travelling to Australia with a visa that gives you the right to live permanently in Australia.
Please note: if you plan to visit Australia (that is, not stay permanently), and you want to work, please read the information about temporary entry.
C.Watters csci6403 6
Index Term Weights
• For each index term ki in document dj a weight wi,j is assigned, (0..1)
• Generally assumed to be independent
• What does this tell us?– (0,1)– (0..1)
C.Watters csci6403 7
Document as Set of Terms
• Document is represented by set of terms
• dj = {w1,j , w2,j , w3,j , …. wn,j }
• Where w1,j is the weight of term1in docj
• So ?? If – w1,j = 0
– w1,j = 1
– w1,j = .2
C.Watters csci6403 8
Inverted File
• Term -> { occurrences}
• Organized for fast access by term
• Plus any extra information you need for your retrieval algorithm
• Size??
C.Watters csci6403 9
Boolean
• Based on set theory using index terms– Term weights: wi,j = {0,1}– Document vector: dj = (0,1,0,…) – Boolean query: AND OR NOT– Q=t1 AND t2 OR t3
• Australia AND work AND papers• Australia AND visa• Australia OR visa
C.Watters csci6403 10
Boolean Representation
• Sim(dj,q)={0,1}• Sample (t1= Australia t2= visa t3=outback)
– d1= (0,1,0)– d2=(1,1,0)– d3=(0,1,1)
• Australia and visa sim(d1,q)=• Australia or visa sim(d2,q)=• Australia not visa sim(d3,q)=
C.Watters csci6403 11
Index Structure
• Australia:1,4,77
• Migrant: 1,5,87,97,123
• Visa:4,19, 55, 97
• Algorithm???
C.Watters csci6403 12
Complex queries
• (red or blue) and (sedan or (suv and ford))
• Efficiency?
C.Watters csci6403 13
Problems
• Misinterpretation of query by users• Mouse device• Binary weights used for index terms• Red BMW Convertible• Elimination of partial results• Binary results
– Document either fits or doesn’t
– Too few or too many results
C.Watters csci6403 14
Dominance of this model
• Simple to implement
• Simple to use
• Examples?
C.Watters csci6403 15
Vector Space Model
• Relax binary weight restriction
• Allow partial matches
• Provide ranking of results
• Goal: determine the degree of similarity between each document and the query
C.Watters csci6403 16
• Given n possible index terms• For each document
• ith term in jth document• Has term weight in jth doc wi,j = [0..1]• Giving dj=(w1,j, w2,j,…wn,j )
• For each query term• kth term has a query weight• wk,q = [0..1]• Giving q=(w1,q ,w2,q ,…,wn,q)
C.Watters csci6403 17
C.Watters csci6403 18
• Calculate similarities
• Rank
• Use threshold
• Q=Heat (.8) Film(.3) H’wood(.5)
• Result / Order
• Boolean result?
C.Watters csci6403 19
Index Term Weights
• Given a set of documents• Goals
– Find features that describe document X– Find features that differentiate doc X from Y
• IR treats documents as clusters (bags) of terms– Intra-cluster similarity– Inter-cluster dissimilarity
C.Watters csci6403 20
Intra document term similarity
• Raw frequency of terms within the doc
• tf or term frequency factor
• Problems– Common words– Size of document
• Normalized tf, fi,j = freqi,j
• max( freqk,j )
C.Watters csci6403 21
Inter Document Dissimilarity
• Measure frequency of terms across doc set
• idf or inverse document frequency
• idfi = log N
• ni
• N is number of documents
• ni is number of documents with term ki
• Dampens the effect of increases in set size
C.Watters csci6403 22
C.Watters csci6403 23
So
• Term frequency -> more is better
• Document frequency -> less is better
• Together accentuate difference
• Migrate 3 times (10 docs out of 500)
• Australia 5 times (400 docs out of 500)
C.Watters csci6403 24
OK
• Use term weights to calculate
• Document to document similarity
• (more high weight terms in common)
• And
• Query to Document similarity
• (query terms are high weight terms in doc)
C.Watters csci6403 25
Document-Document Similarity
C.Watters csci6403 26
Example
• Document 1: Australia sample document – Australia weight .05– Migrate weight .56
• Document 2: Geese Migration– Geese weight .45– Migrate weight .55
C.Watters csci6403 27
Vector Structure
• Doc1: .1, 0,0,.4, 0, 0, 0,.8,.7, 0,.7,.7
• Doc2: .1,.1,0,.1, 0,.8,.7,.9,.7,.1,.2,.3
• Doc3: .4,.1,0, 0,.9,.5,.5, 0, 0, 0,.9,.7
• Algorithm???
C.Watters csci6403 28
Query Document Similarity
• Sim(D,Q)=SUM(wi,q* wi,d)
• So query = Australia (.5) Geese (.8)
• Sim(doc1,Q)=
• Sim(doc2,Q)=
C.Watters csci6403 29
Doing Better!
• Augmented schemes
• Vector space similarity measures
C.Watters csci6403 30
Query Term Weights
• Natural language query
• I am doing a paper on shipping for my class at Dalhousie. Are there any reports from this university on deep sea shipping.
• Frequency
• Part of speech
C.Watters csci6403 31
Using Similarity: Partial Matches
• wi,jand wi,qthen sim(q,dj)=[0…1]
• Every document has a similarity value to every query
• E.g., Dalhousie shipping
• What does OR mean
• What does AND mean
• How to manage this
C.Watters csci6403 32
Using Similarity: Ranking
• Order results by similarity value
• Dalhousie Shipping ??
• Query and documentTerm weights
C.Watters csci6403 33
Similarity of Q to Docs(Normalize)
dj
q
sim(dj,q)=cosine
C.Watters csci6403 34
C.Watters csci6403 35
C.Watters csci6403 36
So why do we need a vector???
C.Watters csci6403 37
C.Watters csci6403 38
C.Watters csci6403 39
Other similarity measurements
C.Watters csci6403 40
Cosine Similarity
C= Terms in common, A terms in i, and B terms in j
C.Watters csci6403 41
Dice similarity Measure
C= Terms in common, A terms in i, and B terms in j
C.Watters csci6403 42
Jaccard Similarity Measure
C= Terms in common, A terms in i, and B terms in j
C.Watters csci6403 43
Vector Space Model
• Advantages– Allows partial matches– Allows ranking
• Disadvantages– Need whole doc set to determine weights– Extra computation– Terms are assumed to be independent
C.Watters csci6403 44
NeoClassical Models****
• Probabilistic model• Boolean variations
– Fuzzy set model– Extended Boolean
• Vector space variations– Generalized vector space (term dependency)– Latent Semantic indexing– Neural net models