web- and multimedia-based information systems lecture 2
TRANSCRIPT
Web- and Multimedia-based Information Systems
Lecture 2
Vector Model
Non-binary Weigths Degree of similarity Result ranking possible Fast & Good results
Vector Model
Document Vector with weights for every index term
Query Vector with weights for every index term
Vectors of the dimension of the total number of index terms in the collection
Documents in Vector Space
t1
t2
t3
D1
D2
D10D3
D9
D4
D7D8
D5
D11
D6
Vector Model
Position 1 corresponds to term 1, position 2 to term 2, position t to term t
The weight of the term is stored in each position
absent is terma if 0
,...,,
,...,,
21
21
w
wwwQ
wwwD
qtqq
dddi itii
Vector Model
Cosine of the angle between the vectors taken as similarity measure
Sorting/Ranking of results Threshold for results More precise answer with more relevant docs
on the top
Similarity Function
*
),(
1
2,
1
2,
1
t
jqi
t
iji
t
kjkik
ji
ww
wwDDsim
ji
jiji
dd
ddDDsim
cos),(
Vector Model Index Terms Weighting
Binary Weights Raw Term Weights Term frequency x Inverse document
frequency
Binary Weights
Only the presence (1) or absence (0) of a term is included in the vector
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1
Raw Term Weights
The frequency of occurrence for the term in each document is included in the vector
docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1
Term frequency x Inverse document frequency
)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
i
ikik
freq
freqtf
max
IDF Example
IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
Probabilistic Model
Based on Probability For every document, a probability is
calculated for:– Document being relevant– Document being irrelevant
to the query
Documents more relevant than not ranked in decreasing order of relevance
Text Operations in Detail
Goal: Automated Generation of Index Terms All terms conveying meaning vs. Space
requirements Rules for extraction from documents
– Rules for divison of terms Punctuation Dashes
– List of Stop Words Articles, prepositions, conjunctions
Word-oriented Reduction Schemes
Lemmatisations Smaller term lists Generalization of terms Methods
– Reduction to the infinitive– Reduction to a stem
Algorithmic Methods for English German:
– Biggest Problems: Prefixes & Compositions– Only with dictionaries
Explicit listing of all forms Or rules to derive forms
Stemming
Different Methods Most efficiently: Affix removal
– Porter Algorithm– Implement later – Series of rules to strip suffixes
s -> nil sses -> ss
Word Type Index Term Selection
Nouns usually convey most meaning Elimination of other word types Clustering of compounds (computer science)
– Noun groups– Maximum distance between terms
Thesauri
„Treasury of words“ For every entry
– Definition– Synonyms
Useful with a specific knowledge domain where a controlled vocabulary can easily be obtained
Difficult with a large and dynamic document collection as the web
Creation of Inverted List
Create Vocabulary Note document, position in Document for
each term Sort List (first by terms, then by positions) Split Terms & Positions
Basic Query
Terms of the query isolated Get pointer to positions for every term Conduct Set Operations Get result documents and present
Advanced Query Functionality
Comparison Operators for Metadata String of multiple terms More general: take into account distance and
order of terms Truncation (Wildcards)
Information Retrieval System Evaluation
Functionality Analysis Performance
– Time– Space
Retrieval Performance– Batch vs. Interactive mode
Retrieval Performance Measures
Recall– The fraction of relevant documents which has
been retrieved
Precision– The fraction of the retrieved documents which is
relevant
Precision vs. Recall
User does usually not inspect all results Example: Relevant documents R={d2, d5} Result ranking returned by system
1. d1 2. d5 3. d2
For the second result, recall is at 50%, precision is also 50%
For the third result, recall is 100%, precision is 67%
Programming Assignment
Programming Assignment
Different part each week Web Search Engine
WWW Search Engine
Search Engine
Indexer
Robot
DB
WWW-Server
Index
WWW-Server WWW-Client
Query Result List
Query Results
Files Request
Documents
Assignment Part 1
Program a web robot Starts at a user-defined URL Navigates the Web via Hypertext links Speaks HTTP (see RFC1945) Stores the path it took (URLs) – preferrable
in a tree-like datastructure Stores result code & important header fields
for every request to disk in a format suitable for further processing
Assignment Part 1 (cont.)
Implementation in Java Pure TCP socket communications No need to save documents in this
assignment Robot shall identify itself via HTTP User-
Agent header Extensibility required for future assignments
Example HTTP session
telnet www 80
GET / HTTP/1.0
HTTP/1.0 200 Document follows
Date: Tue, 10 Sep 1996 14:34:06 GMT
Server: NCSA/1.4.2
Content-type: image/gif
Last-modified: Tue, 10 Sep 1996 13:25:26 GMT
Content-length: 9755
<HTML>
TCP connectionHTTP Request<CRLF>Response Headers
<CRLF>Start of content