c.watterscsci64031 classical ir models. c.watterscsci64032 goal hit set of relevant documents ranked...

C.Watters csci6403 1

Classical IR Models


Goal

• Hit set of relevant documents

• Ranked set

• Best match

• Answer


Models

• Boolean (based on set theory)– fuzzy logic– Extended Boolean

• Vector Space (based on algebra)– Latent semantic networks– Neural networks

• Probabilistic– Inference networks– Belief networks

• Hypertext


Retrieval

• Ad hoc

• Repeated– filter – selective dissemination of information (SDI)– profile

• Browsing


Index terms K={k1,…,kn}

Migration to Australia

This page introduces information about migrating to Australia (as a migrant or refugee), which means travelling to Australia with a visa that gives you the right to live permanently in Australia.

Please note: if you plan to visit Australia (that is, not stay permanently), and you want to work, please read the information about temporary entry.


Index Term Weights

• For each index term ki in document dj a weight wi,j is assigned, (0..1)

• Generally assumed to be independent

• What does this tell us?– (0,1)– (0..1)


Document as Set of Terms

• Document is represented by set of terms

• dj = {w1,j , w2,j , w3,j , …. wn,j }

• Where w1,j is the weight of term1in docj

• So ?? If – w1,j = 0

– w1,j = 1

– w1,j = .2


Inverted File

• Term -> { occurrences}

• Organized for fast access by term

• Plus any extra information you need for your retrieval algorithm

• Size??


Boolean

• Based on set theory using index terms– Term weights: wi,j = {0,1}– Document vector: dj = (0,1,0,…) – Boolean query: AND OR NOT– Q=t1 AND t2 OR t3

• Australia AND work AND papers• Australia AND visa• Australia OR visa


Boolean Representation

• Sim(dj,q)={0,1}• Sample (t1= Australia t2= visa t3=outback)

– d1= (0,1,0)– d2=(1,1,0)– d3=(0,1,1)

• Australia and visa sim(d1,q)=• Australia or visa sim(d2,q)=• Australia not visa sim(d3,q)=


Index Structure

• Australia:1,4,77

• Migrant: 1,5,87,97,123

• Visa:4,19, 55, 97

• Algorithm???


Complex queries

• (red or blue) and (sedan or (suv and ford))

• Efficiency?


Problems

• Misinterpretation of query by users• Mouse device• Binary weights used for index terms• Red BMW Convertible• Elimination of partial results• Binary results

– Document either fits or doesn’t

– Too few or too many results


Dominance of this model

• Simple to implement

• Simple to use

• Examples?


Vector Space Model

• Relax binary weight restriction

• Allow partial matches

• Provide ranking of results

• Goal: determine the degree of similarity between each document and the query


• Given n possible index terms• For each document

• ith term in jth document• Has term weight in jth doc wi,j = [0..1]• Giving dj=(w1,j, w2,j,…wn,j )

• For each query term• kth term has a query weight• wk,q = [0..1]• Giving q=(w1,q ,w2,q ,…,wn,q)


• Calculate similarities

• Rank

• Use threshold

• Q=Heat (.8) Film(.3) H’wood(.5)

• Result / Order

• Boolean result?


Index Term Weights

• Given a set of documents• Goals

– Find features that describe document X– Find features that differentiate doc X from Y

• IR treats documents as clusters (bags) of terms– Intra-cluster similarity– Inter-cluster dissimilarity


Intra document term similarity

• Raw frequency of terms within the doc

• tf or term frequency factor

• Problems– Common words– Size of document

• Normalized tf, fi,j = freqi,j

• max( freqk,j )


Inter Document Dissimilarity

• Measure frequency of terms across doc set

• idf or inverse document frequency

• idfi = log N

• ni

• N is number of documents

• ni is number of documents with term ki

• Dampens the effect of increases in set size


So

• Term frequency -> more is better

• Document frequency -> less is better

• Together accentuate difference

• Migrate 3 times (10 docs out of 500)

• Australia 5 times (400 docs out of 500)


OK

• Use term weights to calculate

• Document to document similarity

• (more high weight terms in common)

• And

• Query to Document similarity

• (query terms are high weight terms in doc)


Document-Document Similarity


Example

• Document 1: Australia sample document – Australia weight .05– Migrate weight .56

• Document 2: Geese Migration– Geese weight .45– Migrate weight .55


Vector Structure

• Doc1: .1, 0,0,.4, 0, 0, 0,.8,.7, 0,.7,.7

• Doc2: .1,.1,0,.1, 0,.8,.7,.9,.7,.1,.2,.3

• Doc3: .4,.1,0, 0,.9,.5,.5, 0, 0, 0,.9,.7

• Algorithm???


Query Document Similarity

• Sim(D,Q)=SUM(wi,q* wi,d)

• So query = Australia (.5) Geese (.8)

• Sim(doc1,Q)=

• Sim(doc2,Q)=


Doing Better!

• Augmented schemes

• Vector space similarity measures


Query Term Weights

• Natural language query

• I am doing a paper on shipping for my class at Dalhousie. Are there any reports from this university on deep sea shipping.

• Frequency

• Part of speech


Using Similarity: Partial Matches

• wi,jand wi,qthen sim(q,dj)=[0…1]

• Every document has a similarity value to every query

• E.g., Dalhousie shipping

• What does OR mean

• What does AND mean

• How to manage this


Using Similarity: Ranking

• Order results by similarity value

• Dalhousie Shipping ??

• Query and documentTerm weights


Similarity of Q to Docs(Normalize)

dj

q

sim(dj,q)=cosine


So why do we need a vector???


Other similarity measurements


Cosine Similarity

C= Terms in common, A terms in i, and B terms in j


Dice similarity Measure



Jaccard Similarity Measure



Vector Space Model

• Advantages– Allows partial matches– Allows ranking

• Disadvantages– Need whole doc set to determine weights– Extra computation– Terms are assumed to be independent


NeoClassical Models****

• Probabilistic model• Boolean variations

– Fuzzy set model– Extended Boolean

• Vector space variations– Generalized vector space (term dependency)– Latent Semantic indexing– Neural net models

c.watterscsci64031 classical ir models. c.watterscsci64032 goal hit set of relevant documents ranked...

Documents

j max freqk

document dj

australia t2

documentith term

document vector

index term ki

query termkth term

csci6403index term weightsfor