introduction to digital libraries information retrieval
Post on 14-Jan-2016
230 Views
Preview:
TRANSCRIPT
Introduction to Digital Libraries
Information Retrieval
Sample Statistics of Text Collections
• Dialog: claims to have >12 terabytes of data in
>600 Databases, > 800 million unique records
• LEXIS/NEXIS: claims 7 terabytes, 1.7 billion
documents, 1.5 million subscribers, 11,400
databases; >200,000 searches per day; 9
mainframes, 300 Unix servers, 200 NT servers
Information Retrieval
• Motivation
– the larger the holdings of the archive, the more
useful it is
– however, it is harder to find what you want
Simple IR ModelUser
Query Results
Pre-Processing
Post-Processing
Searching
Storage
Collection & Processing
BooleanVector
StemmingThesaurusSignature
RankingClusteringWeighting
BooleanVector
Feedback
Flat FilesInverted FilesSignature FilesPAT Trees
StemmingStoplist
5
IR problem• In libraries
ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989Content: <Text>
• external attributes and internal attribute (content)• Search by external attributes = Search in DB• IR: search by content
Basic concepts
• Document is described by a set of representative keywords (index terms)
• Keywords may have binary weights or weights calculated from statistics of their frequency in text
• Retrieval is a ‘matching’ process between document keywords and words in queries
IR Outline• Index Storage
– flat files, inverted files, signature files, PAT trees
• Processing – Stemming, stop-words
• Searching & Queries– Boolean, vector (including ranking, weighting,
feedback)
• Results– clustering
Flat Files Index
• Simple files, no additional processing or storage needed
• Worst case keyword search time: O(DW)– D = # of documents– W = # words per document– linear search
• Clearly only acceptable for small collections
Inverted Files• All input files are read, and a list of which
words appear in what documents (records) is made
• Extra space required can be up to 100% of original input files
• Worst case keyword search time is now O(log(DW))
• Almost all indexing systems in popular usage use inverted files
Sample Inverted File
Term Record Frequencycomputer 1 3computer 3 5computing 2 1distributed 2 1parallel 1 2system 2 1... ... ...
Structure of inverted index
• May be a hierarchical set of addresses, e.g.
word number within sentence number within paragraph number within chapter number within volume number within document number
• Consider as a vector (d,v,c,p,s,w)
Inverted File Index
Store appearance of terms in documents (like index of a book)
alphabetdatabaseindexinformationretrievalsemistructuredXMLXPath
(15,42);(26,186);(31,86)(41,10)(15,76);(51,164);(76,641);(81,64)(16,76)(16,88)(5,61);(15,174);(25,41)(1,108);(2,65);(15,741);(21,421)(5,90);(21,301)
(document-ID,position in the doc)
Answer queries like „xml and index“, „information near retrieval“
But: not suitable for evaluating path expressions
An Inverted File
• Search for– “databases”– “microsoft”
term docURLdata http://www-inst.eecs.berkeley.edu/~cs186database http://www-inst.eecs.berkeley.edu/~cs186date http://www-inst.eecs.berkeley.edu/~cs186day http://www-inst.eecs.berkeley.edu/~cs186dbms http://www-inst.eecs.berkeley.edu/~cs186decision http://www-inst.eecs.berkeley.edu/~cs186demonstrate http://www-inst.eecs.berkeley.edu/~cs186description http://www-inst.eecs.berkeley.edu/~cs186design http://www-inst.eecs.berkeley.edu/~cs186desire http://www-inst.eecs.berkeley.edu/~cs186developer http://www.microsoft.comdiffer http://www-inst.eecs.berkeley.edu/~cs186disability http://www.microsoft.comdiscussion http://www-inst.eecs.berkeley.edu/~cs186division http://www-inst.eecs.berkeley.edu/~cs186do http://www-inst.eecs.berkeley.edu/~cs186document http://www-inst.eecs.berkeley.edu/~cs186
Other indexing structures
• Signature files– Each document has an associated signature, generating
by hashing each term it contains– Leads to possible matches; further processing to resolve
• Bitmaps– One-to-one hash function; each distinct term in
collection has a bit vector with one bit for each document
– Special case of signature file; storage expensive
Signature FilesSignature size. Number of bits in a signature, F.
Word signature. A bit pattern of size F with exactly m bits set to 1 and the others 0.
Block. A sequence of text that contains D distinct words.
Block signature. The logical or of all the word signatures in a block of text.
Signature File
• Each document is divided into “logical blocks”
-- pieces of text that contain a constant number
D of distinct, non-common words
• Each word yields a “word signature” which is a
bit pattern of size F, with m bits set to 1 and the
rest to 0
– F and m are design parameters
Sample Signature File
Word Signature
free 001 000 110 010
text 000 010 101 001
block signature 001 010 111 011
Figure, D=2, F=12, m=4
data 0000 0000 0000 0010 0000
base 0000 0001 0000 0000 0000
management 0000 1000 0000 0000 0000
system 0000 0000 0000 0000 1000
----------------------------------------
block
signature 0000 1001 0000 0010 1000
Figure, D=4, F=20, m=1
Signature File
• Searching
– By examining each block signature for "1" 's in those bit
positions that the signature of the search word has a "1".
– False Drop
– probability that the signature test will “fail”, creating a “false
hit” or “false drop”
– A word signature may match the block signature, but the word is
not in the block. This is a false hit.
Sistrings
• Original text:
”The traditional approach for searching a regular expression…”
• Sistrings
1. “The traditional approach for searching …”2. “he traditional approach for searching a…”
3. “e traditional approach for searching a …”
4. “onal approach for searching a regular …”
Sistrings
• Once upon a time, in a far away land ...– sistring1: Once upon a time ...– sistring2: nce upon a time ...– sistring8: on a time, in a ...– sistring11: a time, in a far ...– sistring22: a far away land ...
PAT Trees• PAT Tree:
– a Patricia Tree constructed over all the possible sistrings of a document
– bits of the key decide branching• 0 is branch to left subtree
• 1 is branch to right subtree
• internal node decides which bit of the key to use
• at leaf node, check any skipped bits
• PAT (Suffix) tree of a string S is a compacted trie that represents all substrings of S or semi-infinite string (sistring).
PATRICIA TREE
• A particular type of “trie”
• Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.
010 011 101
1
0
1
0
1
0 1
Lv0
Lv1
Lv2
trie
Lv0
Lv2 101
011010
10
10
PATRICIA TREE
010 011 101
1
0
1
0
1
0 1
Lv0
Lv1
Lv2
trie
PAT Tree1
22
3 3 4 2
7 5 5 1 6 3
4 8
01100100010111... Text123456789.... Position
Query: 00101
sistrings 1-8already indexed
= sistring
= position to check
Try to build the Patricia tree
• A 00001• S 10011• E 00101• R 10010• C 00011• H 01000• I 01001• N 01110• G 00111• X 11000• M 01101• P 10000
PAT TreePAT Tree
A
E S
R XC H
G I N
M
P
Example
Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...
1 1
21 2
23
1
1
2
23
1
2
14
2
23
1
2
4 3
15
: external node sistring (integer displacement) total displacement of the bit to be inspected
: internal node skip counter & pointer
0 1 0 1
0 1
SISTRING
• Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea!– e.g. CUHK– Corresponding sistrings would be
• CUHK000…• UHK000…• HK000…• K000…
– We require each should be at least 4 characters long.– (Why we pad 0/NULL at the end of sistring?)
SISTRING (USAGE)
• We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage.
– CUHK <- represent C CU CUH CUHK at the same time– UHK0 <- represent U UH UHK at the same time– HK00 <- represent H HK at the same time– K000 <- represent K only
• A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings.
• Conclusion, sistrings is better representation for storing sub-string information.
CUHK 01000011 01010101 00000000 00000000UHK0 01010101 01001000 00000000 00000000HK00 01001000 01001011 00000000 00000000K000 01001011 00000000 00000000 00000000
PAT Tree (Example)
• By digitalizing the string, we can manually visualize how the PAT Tree could be.
• Following is the actual bit patternof the four sistrings
bit 3
bit 4
bit 6
10
10
PAT Tree
UHK0
K000HK00
CUHK
0 1
PAT Tree (Example)
• This works! BUT…– We still need O(n2)
memory for storingthose sistrings
• We may reduce thememory to O(n)by making use ofpoints.
Hello This document is simple 01001000 …This document is simple 01010100 …document is simple 01100100 …is simple 01101001 …simple 01110011 …
bit 2
bit 3
bit 4
10
00
PAT Tree ofa REAL (but very simple)
document
simlpe
is simpledocument is
simple
Hello. This document is
simple.0 1
bit 3
This document is
simple.
11
Space/Time Tradeoffs
Space
Time
inverted files
flat files
signature files
PAT trees
33
Stemming
• Reason: – Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
• Stemming: – Removing some endings of word
computercompute computescomputingcomputedcomputation comput
Inverted File, Stemmed
Term Record Frequencycomput 1 3comput 3 5comput 2 1distribut 2 1parallel 1 2system 2 1... ... ...
Stemming
• am, are, is be car, cars, car's, cars' car
• the boy's cars are different colors the boy car be differ color
Stemming
• Manual or Automatic
• Can reduce index files up to 50%
• Effectiveness studies of stemming are mixed, but in general it has either no effect or a positive effect when measuring both precision and recall
Stopwords• Stopwords exist in stoplists or negative
dictionaries• Idea: remove low semantic content
– index should only have “important stuff”
• What not to index is domain dependent, but often includes:– “small” words: a, and, the, but, of, an, very, etc. – case is removed– punctuation
Stop words
• Very common words that have no discriminatory power
• ( إلى من، (...،في،
Normalization
• Token normalization– Canonicalizing tokens so that matches occur
despite superficial differences in the character sequences of the tokens
– U.S.A vs USA– Anti-discriminatory vs antidiscriminatory– Car vs automobile?
Capitalization/case folding
• Good for– Allow instances of Automobile at the beginning of a
sentence to match with a query of automobile– Helps a search engine when most users type ferrari
when they are interested in a Ferrari car• Bad for
– Proper names vs common nouns– General Motors, Associated Press, Black
• Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning
Performance of search
• 3 major classes of measuring performance– precision / recall
• TREC conference series, http://trec.nist.gov/
– space / time• see Esler & Nelson, JNCA for an example
• http://techreports.larc.nasa.gov/ltrs/PDF/1997/jp/NASA-97-jnca-sle.pdf
– usability• probably the most important measure, but largely ignored
Precision and Recall
• Precision= No. of relevant documents retrieved
Total no. of documents retrieved
• Recall= No. of relevant documents retrieved .
Total no. of relevant documents in database
Standard Evaluation Measures
w x
y z
n2 = w + y
n1 = w + x
N
relevant
not relevant
retrieved not retrieved
Starts with a CONTINGENCY table
Precision and Recall
Recall:
Precision:
w
w+y
w+x
w
From all the documents that are relevant out there,how many did the IR system retrieve?
From all the documents that are retrieved by the IR system, how many are relevant?
User-Centered IR Evaluation
• More user-oriented measures– Satisfaction, informativeness
• Other types of measures– Time, cost-benefit, error rate, task analysis
• Evaluation of user characteristics
• Evaluation of interface
• Evaluation of process or interaction
top related