web- and multimedia-based information systems lecture 2

Web- and Multimedia-based Information Systems

Lecture 2

Vector Model

Non-binary Weigths Degree of similarity Result ranking possible Fast & Good results

Vector Model

Document Vector with weights for every index term

Query Vector with weights for every index term

Vectors of the dimension of the total number of index terms in the collection

Documents in Vector Space

t1

t2

t3

D1

D2

D10D3

D9

D4

D7D8

D5

D11

D6

Vector Model

Position 1 corresponds to term 1, position 2 to term 2, position t to term t

The weight of the term is stored in each position

absent is terma if 0

,...,,

,...,,

21

21

w

wwwQ

wwwD

qtqq

dddi itii

Vector Model

Cosine of the angle between the vectors taken as similarity measure

Sorting/Ranking of results Threshold for results More precise answer with more relevant docs

on the top

Similarity Function

*

),(

1

2,

1

2,

1

t

jqi

t

iji

t

kjkik

ji

ww

wwDDsim

ji

jiji

dd

ddDDsim

cos),(

Vector Model Index Terms Weighting

Binary Weights Raw Term Weights Term frequency x Inverse document

frequency

Binary Weights

Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1

Raw Term Weights

The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1

Term frequency x Inverse document frequency

)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

i

ikik

freq

freqtf

max

IDF Example

IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

Probabilistic Model

Based on Probability For every document, a probability is

calculated for:– Document being relevant– Document being irrelevant

to the query

Documents more relevant than not ranked in decreasing order of relevance

Text Operations in Detail

Goal: Automated Generation of Index Terms All terms conveying meaning vs. Space

requirements Rules for extraction from documents

– Rules for divison of terms Punctuation Dashes

– List of Stop Words Articles, prepositions, conjunctions

Word-oriented Reduction Schemes

Lemmatisations Smaller term lists Generalization of terms Methods

– Reduction to the infinitive– Reduction to a stem

Algorithmic Methods for English German:

– Biggest Problems: Prefixes & Compositions– Only with dictionaries

Explicit listing of all forms Or rules to derive forms

Stemming

Different Methods Most efficiently: Affix removal

– Porter Algorithm– Implement later – Series of rules to strip suffixes

s -> nil sses -> ss

Word Type Index Term Selection

Nouns usually convey most meaning Elimination of other word types Clustering of compounds (computer science)

– Noun groups– Maximum distance between terms

Thesauri

„Treasury of words“ For every entry

– Definition– Synonyms

Useful with a specific knowledge domain where a controlled vocabulary can easily be obtained

Difficult with a large and dynamic document collection as the web

Creation of Inverted List

Create Vocabulary Note document, position in Document for

each term Sort List (first by terms, then by positions) Split Terms & Positions

Basic Query

Terms of the query isolated Get pointer to positions for every term Conduct Set Operations Get result documents and present

Advanced Query Functionality

Comparison Operators for Metadata String of multiple terms More general: take into account distance and

order of terms Truncation (Wildcards)

Information Retrieval System Evaluation

Functionality Analysis Performance

– Time– Space

Retrieval Performance– Batch vs. Interactive mode

Retrieval Performance Measures

Recall– The fraction of relevant documents which has

been retrieved

Precision– The fraction of the retrieved documents which is

relevant

Precision vs. Recall

User does usually not inspect all results Example: Relevant documents R={d2, d5} Result ranking returned by system

1. d1 2. d5 3. d2

For the second result, recall is at 50%, precision is also 50%

For the third result, recall is 100%, precision is 67%

Programming Assignment

Programming Assignment

Different part each week Web Search Engine

WWW Search Engine

Search Engine

Indexer

Robot

DB

WWW-Server

Index

WWW-Server WWW-Client

Query Result List

Query Results

Files Request

Documents

Assignment Part 1

Program a web robot Starts at a user-defined URL Navigates the Web via Hypertext links Speaks HTTP (see RFC1945) Stores the path it took (URLs) – preferrable

in a tree-like datastructure Stores result code & important header fields

for every request to disk in a format suitable for further processing

Assignment Part 1 (cont.)

Implementation in Java Pure TCP socket communications No need to save documents in this

assignment Robot shall identify itself via HTTP User-

Agent header Extensibility required for future assignments

Example HTTP session

telnet www 80

GET / HTTP/1.0

HTTP/1.0 200 Document follows

Date: Tue, 10 Sep 1996 14:34:06 GMT

Server: NCSA/1.4.2

Content-type: image/gif

Last-modified: Tue, 10 Sep 1996 13:25:26 GMT

Content-length: 9755

<HTML>

TCP connectionHTTP Request<CRLF>Response Headers

<CRLF>Start of content

web- and multimedia-based information systems lecture 2

Documents

vector slide

position slide

collection slide

web slide

present slide

d6d6 slide

index term query vector

vector model document