introduction to digital libraries information retrieval

Introduction to Digital Libraries

Information Retrieval

Sample Statistics of Text Collections

• Dialog: claims to have >12 terabytes of data in

>600 Databases, > 800 million unique records

• LEXIS/NEXIS: claims 7 terabytes, 1.7 billion

documents, 1.5 million subscribers, 11,400

databases; >200,000 searches per day; 9

mainframes, 300 Unix servers, 200 NT servers

Information Retrieval

• Motivation

– the larger the holdings of the archive, the more

useful it is

– however, it is harder to find what you want

Simple IR ModelUser

Query Results

Pre-Processing

Post-Processing

Searching

Storage

Collection & Processing

BooleanVector

StemmingThesaurusSignature

RankingClusteringWeighting

BooleanVector

Feedback

Flat FilesInverted FilesSignature FilesPAT Trees

StemmingStoplist

IR problem• In libraries

ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation,

analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989Content: <Text>

• external attributes and internal attribute (content)• Search by external attributes = Search in DB• IR: search by content

Basic concepts

• Document is described by a set of representative keywords (index terms)

• Keywords may have binary weights or weights calculated from statistics of their frequency in text

• Retrieval is a ‘matching’ process between document keywords and words in queries

IR Outline• Index Storage

– flat files, inverted files, signature files, PAT trees

• Processing – Stemming, stop-words

• Searching & Queries– Boolean, vector (including ranking, weighting,

feedback)

• Results– clustering

Flat Files Index

• Simple files, no additional processing or storage needed

• Worst case keyword search time: O(DW)– D = # of documents– W = # words per document– linear search

• Clearly only acceptable for small collections

Inverted Files• All input files are read, and a list of which

words appear in what documents (records) is made

• Extra space required can be up to 100% of original input files

• Worst case keyword search time is now O(log(DW))

• Almost all indexing systems in popular usage use inverted files

Sample Inverted File

Term Record Frequencycomputer 1 3computer 3 5computing 2 1distributed 2 1parallel 1 2system 2 1... ... ...

Structure of inverted index

• May be a hierarchical set of addresses, e.g.

word number within sentence number within paragraph number within chapter number within volume number within document number

• Consider as a vector (d,v,c,p,s,w)

Inverted File Index

Store appearance of terms in documents (like index of a book)

alphabetdatabaseindexinformationretrievalsemistructuredXMLXPath

(15,42);(26,186);(31,86)(41,10)(15,76);(51,164);(76,641);(81,64)(16,76)(16,88)(5,61);(15,174);(25,41)(1,108);(2,65);(15,741);(21,421)(5,90);(21,301)

(document-ID,position in the doc)

Answer queries like „xml and index“, „information near retrieval“

But: not suitable for evaluating path expressions

An Inverted File

• Search for– “databases”– “microsoft”

term docURLdata http://www-inst.eecs.berkeley.edu/~cs186database http://www-inst.eecs.berkeley.edu/~cs186date http://www-inst.eecs.berkeley.edu/~cs186day http://www-inst.eecs.berkeley.edu/~cs186dbms http://www-inst.eecs.berkeley.edu/~cs186decision http://www-inst.eecs.berkeley.edu/~cs186demonstrate http://www-inst.eecs.berkeley.edu/~cs186description http://www-inst.eecs.berkeley.edu/~cs186design http://www-inst.eecs.berkeley.edu/~cs186desire http://www-inst.eecs.berkeley.edu/~cs186developer http://www.microsoft.comdiffer http://www-inst.eecs.berkeley.edu/~cs186disability http://www.microsoft.comdiscussion http://www-inst.eecs.berkeley.edu/~cs186division http://www-inst.eecs.berkeley.edu/~cs186do http://www-inst.eecs.berkeley.edu/~cs186document http://www-inst.eecs.berkeley.edu/~cs186

Other indexing structures

• Signature files– Each document has an associated signature, generating

by hashing each term it contains– Leads to possible matches; further processing to resolve

• Bitmaps– One-to-one hash function; each distinct term in

collection has a bit vector with one bit for each document

– Special case of signature file; storage expensive

Signature FilesSignature size. Number of bits in a signature, F.

Word signature. A bit pattern of size F with exactly m bits set to 1 and the others 0.

Block. A sequence of text that contains D distinct words.

Block signature. The logical or of all the word signatures in a block of text.

Signature File

• Each document is divided into “logical blocks”

-- pieces of text that contain a constant number

D of distinct, non-common words

• Each word yields a “word signature” which is a

bit pattern of size F, with m bits set to 1 and the

rest to 0

– F and m are design parameters

Sample Signature File

Word Signature

free 001 000 110 010

text 000 010 101 001

block signature 001 010 111 011

Figure, D=2, F=12, m=4

data 0000 0000 0000 0010 0000

base 0000 0001 0000 0000 0000

management 0000 1000 0000 0000 0000

system 0000 0000 0000 0000 1000

----------------------------------------

signature 0000 1001 0000 0010 1000

Figure, D=4, F=20, m=1

Signature File

• Searching

– By examining each block signature for "1" 's in those bit

positions that the signature of the search word has a "1".

– False Drop

– probability that the signature test will “fail”, creating a “false

hit” or “false drop”

– A word signature may match the block signature, but the word is

not in the block. This is a false hit.

Sistrings

• Original text:

”The traditional approach for searching a regular expression…”

• Sistrings

1. “The traditional approach for searching …”2. “he traditional approach for searching a…”

3. “e traditional approach for searching a …”

4. “onal approach for searching a regular …”

Sistrings

• Once upon a time, in a far away land ...– sistring1: Once upon a time ...– sistring2: nce upon a time ...– sistring8: on a time, in a ...– sistring11: a time, in a far ...– sistring22: a far away land ...

PAT Trees• PAT Tree:

– a Patricia Tree constructed over all the possible sistrings of a document

– bits of the key decide branching• 0 is branch to left subtree

• 1 is branch to right subtree

• internal node decides which bit of the key to use

• at leaf node, check any skipped bits

• PAT (Suffix) tree of a string S is a compacted trie that represents all substrings of S or semi-infinite string (sistring).

PATRICIA TREE

• A particular type of “trie”

• Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.

010 011 101

Lv2 101

011010

PATRICIA TREE

010 011 101

PAT Tree1

3 3 4 2

7 5 5 1 6 3

01100100010111... Text123456789.... Position

Query: 00101

sistrings 1-8already indexed

= sistring

= position to check

Try to build the Patricia tree

• A 00001• S 10011• E 00101• R 10010• C 00011• H 01000• I 01001• N 01110• G 00111• X 11000• M 01101• P 10000

PAT TreePAT Tree

R XC H

Example

Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...

: external node sistring (integer displacement) total displacement of the bit to be inspected

: internal node skip counter & pointer

0 1 0 1

SISTRING

• Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea!– e.g. CUHK– Corresponding sistrings would be

• CUHK000…• UHK000…• HK000…• K000…

– We require each should be at least 4 characters long.– (Why we pad 0/NULL at the end of sistring?)

SISTRING (USAGE)

• We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage.

– CUHK <- represent C CU CUH CUHK at the same time– UHK0 <- represent U UH UHK at the same time– HK00 <- represent H HK at the same time– K000 <- represent K only

• A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings.

• Conclusion, sistrings is better representation for storing sub-string information.

CUHK 01000011 01010101 00000000 00000000UHK0 01010101 01001000 00000000 00000000HK00 01001000 01001011 00000000 00000000K000 01001011 00000000 00000000 00000000

PAT Tree (Example)

• By digitalizing the string, we can manually visualize how the PAT Tree could be.

• Following is the actual bit patternof the four sistrings

PAT Tree

K000HK00

PAT Tree (Example)

• This works! BUT…– We still need O(n2)

memory for storingthose sistrings

• We may reduce thememory to O(n)by making use ofpoints.

Hello This document is simple 01001000 …This document is simple 01010100 …document is simple 01100100 …is simple 01101001 …simple 01110011 …

PAT Tree ofa REAL (but very simple)

document

simlpe

is simpledocument is

simple

Hello. This document is

simple.0 1

This document is

simple.

Space/Time Tradeoffs

inverted files

flat files

signature files

PAT trees

Stemming

• Reason: – Different word forms may bear similar meaning (e.g. search,

searching): create a “standard” representation for them

• Stemming: – Removing some endings of word

computercompute computescomputingcomputedcomputation comput

Inverted File, Stemmed

Term Record Frequencycomput 1 3comput 3 5comput 2 1distribut 2 1parallel 1 2system 2 1... ... ...

Stemming

• am, are, is be car, cars, car's, cars' car

• the boy's cars are different colors the boy car be differ color

Stemming

• Manual or Automatic

• Can reduce index files up to 50%

• Effectiveness studies of stemming are mixed, but in general it has either no effect or a positive effect when measuring both precision and recall

Stopwords• Stopwords exist in stoplists or negative

dictionaries• Idea: remove low semantic content

– index should only have “important stuff”

• What not to index is domain dependent, but often includes:– “small” words: a, and, the, but, of, an, very, etc. – case is removed– punctuation

Stop words

• Very common words that have no discriminatory power

• ( إلى من، (...،في،

Normalization

• Token normalization– Canonicalizing tokens so that matches occur

despite superficial differences in the character sequences of the tokens

– U.S.A vs USA– Anti-discriminatory vs antidiscriminatory– Car vs automobile?

Capitalization/case folding

• Good for– Allow instances of Automobile at the beginning of a

sentence to match with a query of automobile– Helps a search engine when most users type ferrari

when they are interested in a Ferrari car• Bad for

– Proper names vs common nouns– General Motors, Associated Press, Black

• Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning

Performance of search

• 3 major classes of measuring performance– precision / recall

• TREC conference series, http://trec.nist.gov/

– space / time• see Esler & Nelson, JNCA for an example

• http://techreports.larc.nasa.gov/ltrs/PDF/1997/jp/NASA-97-jnca-sle.pdf

– usability• probably the most important measure, but largely ignored

Precision and Recall

• Precision= No. of relevant documents retrieved

Total no. of documents retrieved

• Recall= No. of relevant documents retrieved .

Total no. of relevant documents in database

Standard Evaluation Measures

n2 = w + y

n1 = w + x

relevant

not relevant

retrieved not retrieved

Starts with a CONTINGENCY table

Precision and Recall

Recall:

Precision:

From all the documents that are relevant out there,how many did the IR system retrieve?

From all the documents that are retrieved by the IR system, how many are relevant?

User-Centered IR Evaluation

• More user-oriented measures– Satisfaction, informativeness

• Other types of measures– Time, cost-benefit, error rate, task analysis

• Evaluation of user characteristics

• Evaluation of interface

• Evaluation of process or interaction

introduction to digital libraries information retrieval

Documents

1 vienna university of technology (vienna – 21 sept 2007)...

applying lis disciplines to the dmbok knowledge areas ·...

semantic linking & retrieval for digital libraries

modern information retrieval -...

musings at the crossroads of digital libraries, information...

fusion of libraries and the web: subject-based information...

3/20/2000principles of information retrieval digital...

caliph & emir: semantics in multimedia retrieval and ... ·...

dl - 2004introduction – beeri/feitelson1 digital libraries...

metadata architecture for digital libraries: conceptual...

digital libraries: information...

information retrieval in context of digital libraries - or...

chapter 18 - content-based retrieval in digital libraries

1 wmes3103 : information retrieval week 13 digital libraries

on synergies between information retrieval and digital...

modern information retrieval chapter 15: digital libraries...

rcdl conference, petrozavodsk, russia context-based...

digital libraries for e-rulemaking: integrating the ... ·...

mysite.pratt.edumysite.pratt.edu/~ilopatov/master-bib.doc ·...

digital librariesdigital libraries -...