classic ir models - dalhousie universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · classic ir...
TRANSCRIPT
Classic IR Models
5/6/2012 1
Classic IR Models
• Idea
– Each document is represented by index terms.
– An index term is basically a (word) whose semantics give meaning to the document.
– Not all index terms are equally useful for describing the document content.
– The effect of index terms on the document is captured by weights to each term in the document.
5/6/2012 4
Definition
• Let
– t be the number of index terms in the corpus (or system).
– ki a generic index term
– K= { k1, k2, …, kt) the set of index terms
– wi,j >0 is a weight associated with each index term ki in a document dj.
– wi,j =0 if ki does not appear in dj.
– With dj associated an index term vector dj = (w1,j , w2,j , …, wt,j).
– gi is a ranking function that returns the weight associated with the index term
ki in dj, gi (dj)= wi,j.
5/6/2012 5
IR Models
Non-Overlapping Lists
Proximal Nodes
Structured Models
Retrieval:
Adhoc
Filtering
Browsing
U
s
e
r
T
a
s
k
Classic Models
boolean
vector
probabilistic
Set Theoretic
Fuzzy
Extended Boolean
Probabilistic
Inference Network
Belief Network
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Browsing
Flat
Structure Guided
Hypertext
5/6/2012 6
Basic Idea
• Document: set of terms
• Query: Boolean expression over terms
– Satisfying:
• Document evaluates to "true" on single-term query if it contains that term
• Evaluate document on expression query as you would any Boolean expression
• Document satisfies query if evaluates to true on query
Credit: Princeton 5/6/2012 7
Satisfying a Query in the Boolean Model
• What determines if document satisfies
• query?
– That depends ….
• Document model
• Query model
• START SIMPLE
– better understanding
– Use components of simple model later
5/6/2012 8
Boolean Model Example
• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.
• Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …”
• Query:
– (principles AND knowledge) OR (science AND engineering)
5/6/2012 9
Boolean Model Example
• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.
• Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …”
• Query:
– (principles AND knowledge) OR (science AND engineering)
0 1 1 0 0
Doc 1: FALSE 5/6/2012 10
Boolean Model Example
• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.
• Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …”
• Query:
– (principles AND knowledge) OR (science AND engineering)
1 0 1 1 1
Doc 2: TRUE 5/6/2012 11
Exercise
• Use Doc 1 and Doc 2
• (principles OR knowledge) AND (science AND NOT(engineering))
• (principles OR knowledge) AND (science AND NOT(engineering))
5/6/2012 12
Implementation Example (Boolean Model)
• Suppose we have a data set of three documents as follows:
– D1 = Programming in Java
– D2 = OO Programming
– D3 = Databases and SQL Programming
• in, & and dropped (stop words)
5/6/2012 15
Implementation Example (Boolean Model)
• Primary Index
• Inverted Index
Database Java OO Programming SQL
D1 0 1 0 1 0
D2 0 0 1 1 0
D3 1 0 0 1 1
Term Freq. Pointer
Database 1
Java 1
OO 1
Programming 3
SQL 1
Postings List
D3
D1
D2
D1,D2,D3
D3
5/6/2012 16
Term-Document Incidence Boolean Model
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains word, 0
otherwise Brutus AND Caesar BUT NOT Calpurnia
5/6/2012 17
Incidence Vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
5/6/2012 18
Answers to Query
• Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
• Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me.
5/6/2012 19
Exercise 1
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”
• Q1 = “information retrieval”
• Q2 = “information ¬computer”
5/6/2012 20
Exercise 2 0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
5/6/2012 21
Retrieval Evaluation
• User Evaluation
– Relevant
– Not relevant
• System Evaluation
– Retrieved
– Not Retrieved
Rel. Not Rel.
Ret. a b
Not Ret. c d
Recall R= a / (a+c)
Precision P = a / (a+b)
5/6/2012 22
Drawing of Recall-Precision
http://ralphlosey.files.wordpress.com http://ilab.cs.ucsb.edu/
5/6/2012 23
Advantage of Boolean IR Modeling
• The Boolean Model
– Fast to implement
– Fast to process a query
– Simple
5/6/2012 24
Boolean Modeling Pitfalls
• Retrieval based on binary decision criteria with no notion of partial
matching.
• No ranking of the documents is provided (absence of a grading scale).
• Information need has to be translated into a Boolean expression which
most users find awkward.
• The Boolean queries formulated by the users are most often too simplistic.
• As a consequence, the Boolean model frequently returns either too few or too many
documents in response to a user query.
5/6/2012 25
Always Remember!
• We care about modeling.
• Implementation can be done in different ways.
• Which way you should select:
– It depends.
• You can go with a hash-table/hash-tree, When?
• You can use a B-tree, When?
• More about this in assignment 2.
• The Boolean model has extended forms.
• The Boolean model does not take care of ranking (setting rendering priorities).
5/6/2012 26
The Inverted Index
Boolean Model Continued
5/6/2012 27
Example from last class
5/6/2012 28
Implementation Example (Boolean Model)
• Suppose we have a data set of three documents as follows:
– D1 = Programming in Java
– D2 = OO Programming
– D3 = Databases and SQL Programming
• in, & and dropped (stop words)
5/6/2012 29
Implementation Example (Boolean Model)
• Primary Index
• Inverted Index
Database Java OO Programming SQL
D1 0 1 0 1 0
D2 0 0 1 1 0
D3 1 0 1 1 1
Term Freq. Pointer
Database 1
Java 1
OO 2
Programming 3
SQL 1
Postings List
D3
D1
D2
D1,D2,D3
D3
5/6/2012 30
Also,
Look at Google’s Paper
The Anatomy of a Large-scale Hypertextual Search Engine
5/6/2012 31
More Detailed Inverted Index
5/6/2012 32
Introduction to Information Retrieval
Inverted index
For each term t, we must store a list of all documents that contain t.
Identify each by a docID, a document serial number
Can we use fixed-size arrays for this?
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
What happens if the word Caesar
is added to document 14?
Sec. 1.2
174
54 101
5/6/2012 33
Introduction to Information Retrieval
Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays Some tradeoffs in size/ease of insertion
Dictionary Postings
Sorted by docID (more later on why).
Posting
Sec. 1.2
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
5/6/2012 34
Introduction to Information Retrieval
Tokenizer
Token stream. Friends Romans Countrymen
Inverted index construction
Linguistic
modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
More on
these later.
Documents to
be indexed.
Friends, Romans, countrymen.
Sec. 1.2
5/6/2012 35
Introduction to Information Retrieval
Indexer steps: Token sequence
Sequence of (Modified token, Document ID) pairs.
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Sec. 1.2
5/6/2012 36
Introduction to Information Retrieval
Indexer steps: Sort
Sort by terms And then docID
Core indexing step, why?
Sec. 1.2
5/6/2012 37
Introduction to Information Retrieval
Indexer steps: Dictionary & Postings
Multiple term entries in a single document are merged.
Split into Dictionary and Postings
Doc. frequency information is added.
Why frequency? Will discuss later.
Sec. 1.2
5/6/2012 38
Introduction to Information Retrieval
Where do we pay in storage?
Pointers
Terms and
counts Later in the
course:
•How do we
index
efficiently?
•How much
storage do we
need?
Sec. 1.2
Lists of docIDs
5/6/2012 39
Introduction to Information Retrieval
The index we just built
How do we process a query?
Later - what kinds of queries can we process?
Sec. 1.3
5/6/2012 40
Introduction to Information Retrieval
Query processing: AND
Consider processing the query:
Brutus AND Caesar
Locate Brutus in the Dictionary; Retrieve its postings.
Locate Caesar in the Dictionary; Retrieve its postings.
“Merge” the two postings:
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
Sec. 1.3
5/6/2012 41
Introduction to Information Retrieval
The merge
Walk through the two postings simultaneously, in time linear in the total number of postings entries
34
128 2 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar 2 8
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
Sec. 1.3
5/6/2012 42
Introduction to Information Retrieval
Intersecting two postings lists (a “merge” algorithm)
5/6/2012 43
Introduction to Information Retrieval
Boolean queries: Exact match
The Boolean retrieval model is being able to ask a query that is a Boolean expression:
Boolean Queries are queries using AND, OR and NOT to join query terms Views each document as a set of words
Is precise: document matches condition or not.
Perhaps the simplest model to build an IR system on
Primary commercial retrieval tool for 3 decades.
Many search systems you still use are Boolean:
Email, library catalog, Mac OS X Spotlight
Sec. 1.3
5/6/2012 44
Introduction to Information Retrieval
Example: WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal
search service (started 1975; ranking added
1992)
Tens of terabytes of data; 700,000 users
Majority of users still use boolean queries
Example query:
What is the statute of limitations in cases involving
the federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT
/3 CLAIM
/3 = within 3 words, /S = in same sentence
Sec. 1.4
5/6/2012 45
Introduction to Information Retrieval
Example: WestLaw http://www.westlaw.com/
Another example query:
Requirements for disabled people to be able to access a workplace
disabl! /p access! /s work-site work-place (employment /3 place
Note that SPACE is disjunction, not conjunction!
Long, precise queries; proximity operators; incrementally developed; not like web search
Many professional searchers still like Boolean search
You know exactly what you are getting
But that doesn’t mean it actually works better….
Sec. 1.4
5/6/2012 46
Introduction to Information Retrieval
Boolean queries: More general merges
Exercise: Adapt the merge for the queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar
Can we still run through the merge in time O(x+y)?
What can we achieve?
Sec. 1.3
5/6/2012 47
Introduction to Information Retrieval
Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
Can we always merge in “linear” time?
Linear in what?
Can we do better?
Sec. 1.3
5/6/2012 48
Introduction to Information Retrieval
Query optimization
What is the best order for query processing?
Consider a query that is an AND of n terms.
For each of the n terms, get its postings, then AND them together.
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
Query: Brutus AND Calpurnia AND Caesar 49
Sec. 1.3
5/6/2012 49
Introduction to Information Retrieval
Query optimization example
Process in order of increasing freq:
start with smallest set, then keep cutting further.
This is why we kept
document freq. in dictionary
Execute the query as (Calpurnia AND Brutus) AND Caesar.
Sec. 1.3
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
5/6/2012 50
Introduction to Information Retrieval
More general optimization
e.g., (madding OR crowd) AND (ignoble OR strife)
The Questions is, what size of the AND can be done faster?
Get doc. freq.’s for all terms.
Estimate the size of each OR by the sum of its doc. freq.’s (conservative).
Process in increasing order of OR sizes.
Sec. 1.3
5/6/2012 51
Introduction to Information Retrieval
Exercise
Recommend a query processing order for
Term Freq
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
(tangerine OR trees) AND
(marmalade OR skies) AND
(kaleidoscope OR eyes)
5/6/2012 52
Classic IR Models
The Vector Space Model
5/6/2012 53
The Vector Space Model
• Document: bag of terms
• Query: list of terms
• Satisfying: – Each document is scored as to the degree it satisfies
query (non-negative real number)
– doc satisfies query if its score is >0
– Documents are returned in a sorted list decreasing by score: • Include only non-zero scores
• Include only highest n documents, some n Hints for
Implementation
5/6/2012 54
How to compute score? Basic Assumptions
• There is a dictionary (aka lexicon) of all terms, numbering t in all
– Number the terms 1, …, t
• Change the model of a document (temporarily):
– A document is a t-dimensional vector
– The ith entry of the vector is the weight (importance of ) term i in the document
5/6/2012 55
The Vector Space
5/6/2012 56
How compute score, continued
• Calculate a vector function of the document vector and the query vector to get the score of the document with respect to the query.
• Choices:
– Measure the distance between the vectors:
• 𝑫𝒊𝒔𝒕 𝒅, 𝒒 = (𝑑𝑖 − 𝑞𝑖)2
𝑡𝑖=1
• Is a dissimilarity measure
• Not normalized: Dist ranges [0, inf.]
• Fix: use e-Dist , range [0,1]
• Is it the right sense of difference?
5/6/2012 57
How compute score, cont’d
• Measure the angle between the vectors:
• Dot product: 𝑑•𝑞 = (𝑑𝑖 ∗ 𝑞𝑖)𝑡𝑖=1
• Is a similarity measure
– Not normalized: dot product ranges (-inf., inf.)
– Fix: use normalized dot product, range [-1,1]
• 𝒔𝒊𝒎 =(𝒅•𝒒)( 𝑑 • 𝑞 )
aka cosine similarity
• • In practice vector components are nonnegative • so range is [0,1]
• • This is the most commonly used function for scoring.
5/6/2012 58
Cosine Similarity
• Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them.
• Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using dot product and magnitude as,
• Dot Product
• Magnitude
Credit: http://www10.org/cdrom/papers/519/node12.html
θ
A
B Cosine Geometrically
5/6/2012 59
How to Compute Weights of Documents
• The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval .
• It is used to evaluate how important a word is to a document in a collection.
• Two factors:
– How frequent the term in the document (More frequent more important)
– How frequent the term in the collection of documents (less frequent more important to the current document)
5/6/2012 60
tf (Term Frequency)
• The term count in the given document is simply the number of times a given term appears in that document.
• Usually normalized (why?) • to prevent a bias towards longer documents.
– e.g. divide by the number of all terms in the document.
• tf is computed as follows: 𝑡𝑓𝑖,𝑗= 𝑛𝑖,𝑗
𝑛𝑘,𝑗𝑘
• ni,j is the number of occurrences of the considered term (ti)
in document dj. • The denominator is the sum of number of occurrences of
all terms in document dj.
5/6/2012 61
idf (Inverse Document Frequency)
• idf is a measure of the general importance of the term.
• Can be computed as follows:
• 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑓 =𝑑:𝑡𝑖 ∈𝑑
𝐷=
#𝑑𝑜𝑐𝑠. 𝑤ℎ𝑒𝑟𝑒 𝑡𝑖 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
# 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑜𝑐𝑠.
• 𝑖𝑑𝑓𝑖 = log𝐷
𝑑:𝑡𝑖 ∈𝑑 +𝟏
• Where:
– 𝐷 is the total number of documents in the corpus.
– 𝑑: 𝑡𝑖 ∈ 𝑑 is the number of documents where the term ti appears.
– “1” is usually added to the denominator to prevent division by ZERO.
5/6/2012 62
idf (Inverse Document Frequency)
• idf is a measure of the general importance of the term.
• Can be computed as follows:
• 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑓 =𝑑:𝑡𝑖 ∈𝑑
𝐷=
#𝑑𝑜𝑐𝑠. 𝑤ℎ𝑒𝑟𝑒 𝑡𝑖 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
# 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑜𝑐𝑠.
• 𝑖𝑑𝑓𝑖 = log𝐷
𝑑:𝑡𝑖 ∈𝑑 +𝟏
• Where:
– 𝐷 is the total number of documents in the corpus.
– 𝑑: 𝑡𝑖 ∈ 𝑑 is the number of documents where the term ti appears.
– “1” is usually added to the denominator to prevent division by ZERO.
5/6/2012 63
tf (Term Frequency), another way to normalize
• 𝐿𝑒𝑡:
• 𝑵 𝑏𝑒 𝑡𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
• 𝒏𝒊 𝑏𝑒 𝑡𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑡𝑎𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑡𝑒𝑟𝑚 𝒊 • 𝒇𝒓𝒆𝒒𝒊,𝒋 𝑏𝑒 𝑡𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑡𝑒𝑟𝑚 𝒊 𝑖𝑛 𝑑𝑜𝑐. 𝒋
• 𝐼𝐷𝐹𝑖 =𝑁
𝑛𝑖= 𝑙𝑜𝑔2 𝑁 − 𝑙𝑜𝑔2 𝑛𝑖
• Now,
• 𝑊𝑖,𝑗 = 𝑡𝑓𝑖• 𝑖𝑑𝑓𝑖,𝑗 = 𝑓𝑟𝑒𝑞𝑖,𝑗 ∗ log𝑁 − log𝑛𝑖
We have to normalize, why?
• 𝑊𝑖,𝑗 = 𝑡𝑓𝑖• 𝑖𝑑𝑓𝑖,𝑗 = 𝑓𝑟𝑒𝑞𝑖,𝑗
𝒎𝒂𝒙(𝒇𝒓𝒆𝒒 𝑳𝒋) ∗ log𝑁 − log𝑛𝑖
• max 𝑓𝑟𝑒𝑞 𝐿𝑗 = max𝑓𝑟𝑒𝑞. 𝑜𝑓 𝑡𝑒 𝑜𝑓
𝑡𝑒 𝑚𝑜𝑠𝑡 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝒕𝒆𝒓𝒎 𝐿 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑗
• We may also need to ad (+1) to: 𝒍𝒐𝒈𝑵 − 𝒍𝒐𝒈𝒏𝒊
Why?
5/6/2012 64
tf-idf
• The weight of a term in a document is:
Wi,j = tfi,j * idfi
• What does it do?
• It usually filters out common terms.
• What about ranking?
5/6/2012 65
Vector Space Model Example
• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific media and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.
• Frequencies: science 1; knowledge 1; principles 0; engineering 0
• Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …”
• Frequencies: science 2; knowledge 0; principles 1; engineering 1
5/6/2012 66
Example, cont’d
• Consider having 5 documents in the collection.
• The Idf for terms in the previous example are:
– science ln(5/2) = 0.51
– engineering, principles, knowledge: ln(5/1) = 1.6
5/6/2012 67
Ranking
• Term by Doc. Table: freqjd * log(N/ nj ).
• Using un-normalized dot product for query: science, engineering, knowledge,
principles,
• Also, using 0/1 query vector, we get:
– Cosine (Doc1, Q) = 0.589
– Cosine (Doc2, Q) = 0.807
Doc 1 Doc 2 Query
science 0.51 1.02 0.51
engineering 0 1.6 1.6
principles 0 1.6 1.6
knowledge 1.6 0 1.6
5/6/2012 68
Vector Space Model (Summary)
• Advantages
– The concept of Ranking.
– Not difficult to implement
– Shown to be effective
• Disadvantages
– What threshold to choose?
– Term Independence
– Term Weights
5/6/2012 69