introduction to digital libraries searching. technical view: retrieval as matching documents to...
TRANSCRIPT
![Page 1: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/1.jpg)
Introduction to Digital Libraries
Searching
![Page 2: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/2.jpg)
Technical View: Retrieval as Matching Documents to Queries
DocumentSpace Sample Sample
QuerySpace
Surrogates Surrogates
Terms
Vectors
Etc..
Query Form B
Etc..
MatchAlgorithm
Retrieval is algorithmic. Evaluation is typically a binary decision for each pairwise match and one or more aggregate values for a set of matches (e.g., recall and precision).
Query Form A
![Page 3: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/3.jpg)
Human View: Information-Seeking Process
Data
Indexes
PhysicalInterface
Problem
PerceivedNeeds Queries
Results
Actions
Information seeking is an active, iterative process controlled by a human who Changes throughout the process. Evaluation is relative to human needs.
![Page 4: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/4.jpg)
IR Models
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval: Adhoc Filtering
Browsing
U s e r
T a s k
Classic Models
boolean vector probabilistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
Browsing
Flat Structure Guided Hypertext
![Page 5: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/5.jpg)
“Classic” Retrieval Models• Boolean
– Documents and queries are sets of index terms
• Vector– Documents and queries are documents in N-
dimensional space
• Probabilistic– Based on probability theory
![Page 6: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/6.jpg)
Boolean Searching
• Exactly what you would expect– and, or, not operations defined
• requires an exact match
• based on inverted file
• (computer and science) and (not(animals)) would prevent a document with “use of computers in animal science research” from being retrieved
![Page 7: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/7.jpg)
Boolean ‘AND’
• Information AND Retrieval
Information Retrieval
![Page 8: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/8.jpg)
Example
• Draw a Venn diagram for: Care and feeding and (cats or dogs)
• What is the meaning of:Information and retrieval and performance
or evaluation
![Page 9: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/9.jpg)
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”
• Q1 = “information retrieval”
• Q2 = “information ¬computer”
![Page 10: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/10.jpg)
Boolean-based Matching
• Exact match systems; separate the documents containing a given term from those that do not.
Doc
umen
ts
0 0 1 1 0 0 0 0 1 1 0 0 0
0 1 1 0 0 0 0 0 0 0 1 1 0
1 0 1 0 1 0 0 1 0 0 0 0 1
1 1 0 0 0 1 1 0 0 0 0 1 0
Terms
adventure
agriculture
bridge
cathedrals
disasters
flags
horticulture
leprosy
Mediterranean
recipes
scholarships
tennis
Venus
Queries
(bridge OR flags) AND tennis
flags AND tennis
leprosy AND tennis
Venus OR (tennis AND flags)
![Page 11: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/11.jpg)
Exercise0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
![Page 12: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/12.jpg)
Boolean features
• Order dependency of operators– ( ), NOT, AND, OR (DIALOG)
– May differ on different systems
• Nesting of search terms– Nutrition and (fast or junk) and food
![Page 13: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/13.jpg)
Boolean Limitations
• Searches can become complex for the average user– too much ANDing can clobber recall– tricky syntax:
“research AND NOT computer science”“research AND NOT (computer science)” (implicit OR)
“research AND NOT (computer AND science)”
all different -- (frequently seen in NTRS logs)
![Page 14: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/14.jpg)
Vector Model
• Calculate degree of similarity between document and query
• Ranked output by sorting similarity values
• Also called ‘vector space model’• Imagine your documents as N-dimensional
vectors (where N=number of words)• The “closeness” of 2 documents can be expressed
as the cosine of the angle between the two vectors
![Page 15: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/15.jpg)
Vector Space Model
• Documents and queries are points in N-dimensional space (where N is number of unique index terms in the data collection)
Q
D
![Page 16: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/16.jpg)
Vector Space Model with Term Weights
• assume document terms have different values for retrieval
• therefore assign weights to each term in each document– example:
• proportional to frequency of term in document
• inversely proportional to frequency of term in collection
![Page 17: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/17.jpg)
Graphic Representation
Example:D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5
• Is D1 or D2 more similar to Q?• How to measure the degree of
similarity? Distance? Angle? Projection?
![Page 18: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/18.jpg)
Document and Query Vectors
• Documents and Queries are vectors of terms• Vectors can use binary keyword weights or
assume 0-1 weights (term frequencies)• Example terms: “dog”,”cat”,”house”, “sink”,
“road”, “car”• Binary: (1,1,0,0,0,0), (0,0,1,1,0,0)• Weighted: (0.01,0.01, 0.002, 0.0,0.0,0.0)
![Page 19: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/19.jpg)
Document Collection Representation• A collection of n documents can be represented in the
vector space model by a term-document matrix.• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : : : : : :Dn w1n w2n … wtn
![Page 20: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/20.jpg)
Inner Product: Example 1
k1 k2 k3 q dj d1 1 0 1 2 d2 1 0 0 1 d3 0 1 1 2 d4 1 0 0 1 d5 1 1 1 3 d6 1 1 0 2 d7 0 1 0 1
q 1 1 1
d1
d2
d3d4 d5
d6d7
k1k2
k3
![Page 21: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/21.jpg)
Vector Space Exampleindexed words:factors information help human operation retrieval systems
Query: human factors in information retrieval systemsVector: (1 1 0 1 0 1 1)Record 1 contains: human, factors, information, retrievalVector: (1 1 0 1 0 1 0)Record 2 contains: human, factors, help, systemsVector: (1 0 1 1 0 0 1)Record 3 contains: factors, operation, systemsVector: (1 0 0 0 1 0 1)
Simple Match
Query (1 1 0 1 0 1 1)Rec1 (1 1 0 1 0 1 0) (1 1 0 1 0 1 0) =4
Query (1 1 0 1 0 1 1)Rec2 (1 0 1 1 0 0 1) (1 0 0 1 0 0 1) =3
Query (1 1 0 1 0 1 1)Rec3 (1 0 0 0 1 0 1) (1 0 0 0 0 0 1) =2
Weighted Match
Query (1 1 0 1 0 1 1)Rec1 (2 3 0 5 0 3 0) (2 3 0 5 0 3 0) =13
Query (1 1 0 1 0 1 1)Rec2 (2 0 4 5 0 0 1) (2 0 0 5 0 0 1) =8
Query (1 1 0 1 0 1 1)Rec3 (2 0 0 0 2 0 1) (2 0 0 0 0 0 1) =3
![Page 22: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/22.jpg)
Term Weights: Term Frequency
• More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j
• May want to normalize term frequency (tf) across the entire corpus: tfij = fij / max{fij}
![Page 23: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/23.jpg)
23
Some formulas for Sim
Dot product
Cosine
Dice
Jaccard
i i iiiii
iii
i iii
iii
i iii
iii
ii
baba
baQDSim
ba
baQDSim
ba
baQDSim
baQDSim
) * (
) * (),(
) * (2),(
*
) * (),(
) * (),(
22
22
22
t1
t2
D
Q
![Page 24: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/24.jpg)
Example
• Documents: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights
• cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999• cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929
SaS PaP WHaffection 115 58 20jealous 10 7 11gossip 2 0 6
SaS PaP WHaffection 0.996 0.993 0.847jealous 0.087 0.120 0.466gossip 0.017 0.000 0.254
![Page 25: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/25.jpg)
Extended Boolean Model
• Boolean model is simple and elegant.
• But, no provision for a ranking
• As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership
• Extend the Boolean model with the notions of partial matching and term weighting
• Combine characteristics of the Vector model with properties of Boolean algebra
![Page 26: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/26.jpg)
The Idea • qor = kx ky; wxj = x and wyj = y
dj
dj+1
y = wyj
x = wxj(0,0)
(1,1)
kx
ky
sim(qor,dj) = sqrt( x + y ) 22 2
OR
We want a document to beas far as possible from (0,0)
![Page 27: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/27.jpg)
Fuzzy Set Model
•Queries and docs represented by sets of index terms: matching is approximate from the start
•This vagueness can be modeled using a fuzzy framework, as follows:
–with each term is associated a fuzzy set
–each doc has a degree of membership in this fuzzy set
•This interpretation provides the foundation for many models for IR based on fuzzy theory
![Page 28: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/28.jpg)
Probabilistic Model• Views retrieval as an attempt to answer a basic question:
“What is the probability that this document is relevant to
this query?”
• expressed as:
P(REL|D)
ie. Probability of x given y (Probability that of relevance
given a particular document D)
![Page 29: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/29.jpg)
Probabilistic Model•An initial set of documents is retrieved somehow
•User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected)
•The system uses this information to refine description of ideal answer set
•By repeting this process, it is expected that the description of the ideal answer set will improve
•Have always in mind the need to guess at the very beginning the description of the ideal answer set
•Description of ideal answer set is modeled in probabilistic terms
![Page 30: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/30.jpg)
Recombination after dimensionality reduction
![Page 31: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/31.jpg)
Classic IR Models
• Vector vs. probabilistic“Numerous experiments demonstrate that
probabilistic retrieval procedures yield good results. However, the results have not been sufficiently better than those obtained using Boolean or vector techniques to convince system developers to move heavily in this direction
![Page 32: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/32.jpg)
Example
• Build the inverted file for the following document
• F1={Written Quiz for Algorithms and Techniques of Information Retrieval}
• F2={Program Quiz for Algorithms and Techniques of Web Search}
• F3={Search on the Web for Information on Algorithms}
![Page 33: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/33.jpg)
Example• You have the collection of documents that contain the
following index terms:
• D1: alpha bravo charlie delta echo foxtrot golf
• D2: golf golf golf delta alpha
• D3: bravo charlie bravo echo foxtrot bravo
• D4: foxtrot alpha alpha golf golf delta
• Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.
![Page 34: Introduction to Digital Libraries Searching. Technical View: Retrieval as Matching Documents to Queries Document Space Sample Query Space Surrogates Terms](https://reader030.vdocument.in/reader030/viewer/2022033101/56649eca5503460f94bd8100/html5/thumbnails/34.jpg)
Terms Documents
c1 c2 c3 c4 c5 m1 m2 m3 m4
__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __
human 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
EPS 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
Give the scores of the 9 documents for the query trees, minors using Boolean search
Give the scores of the 9 documents for the query trees, minors using the vector model.