alexander gelbukh gelbukh

25
Special Topics in Computer Science Special Topics in Computer Science Advanced Topics in Information Advanced Topics in Information Retrieval Retrieval Lecture 4 Lecture 4 (book chapter 8) (book chapter 8) : : Indexing and Searching Indexing and Searching Alexander Gelbukh www.Gelbukh.com

Upload: keitha

Post on 16-Jan-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching. Alexander Gelbukh www.Gelbukh.com. Previous Chapter: Conclusions. Main measures: Precision & Recall. For sets Rankings are evaluated through initial subsets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Alexander Gelbukh Gelbukh

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Lecture 4 Lecture 4 (book chapter 8)(book chapter 8): :

Indexing and SearchingIndexing and Searching

Alexander Gelbukh

www.Gelbukh.com

Page 2: Alexander Gelbukh Gelbukh

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Main measures: Precision & Recall.o For sets

o Rankings are evaluated through initial subsets

There are measures that combine them into oneo Involve user-defined preferences

Many (other) characteristicso An algorithm can be good at some and bad at others

o Averages are used, but not always are meaningful

Reference collection exists with known answers to evaluate new algorithms

Page 3: Alexander Gelbukh Gelbukh

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

Different types of interfaces Interactive systems:

o What measures to use?

o Such as infromativeness

Page 4: Alexander Gelbukh Gelbukh

4

Types of searchingTypes of searching

Indexedo Semi-static

o Space overhead

Sequentialo Small texts

o Volatile, or space limited

Combinedo Index into large portions, then sequential inside portion

o Best combination of speed / overhead

Page 5: Alexander Gelbukh Gelbukh

5

Inverted filesInverted files

Vocabulary: sqrt (n). Heaps’ law. 1GB 5M Occurrences: n * 40% (stopwords)

o positions (word, char), files, sections...

Page 6: Alexander Gelbukh Gelbukh

6

Compression: Block addressingCompression: Block addressing

Block addressing: 5% overheado 256, 64K, ..., blocks (1, 2, ..., bytes)

o Equal size (faster search) or logical sections (retrieval units)

Page 7: Alexander Gelbukh Gelbukh

7

Searching in inverted filesSearching in inverted files

Vocabulary searcho Separate fileo Many searching techniqueso Lexicographic: log V (voc. size) = ½ log n (Heaps)o Hashing is not good for prefix search

Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf)

o Boolean operations. Context search Merging occurrences For AND: One list is usually shorter (Zipf law) sublinear!

Only inverted files allow sublinear both space & timeo Suffix trees and signature files don’t

Page 8: Alexander Gelbukh Gelbukh

8

Building inverted file: 1Building inverted file: 1

Infinite memory? Use trie to store vocabulary. O(n)o append positions

Finite memory? Build in chunks, merge. Almost O(n) Insertion: index + merge. Deleting: O(n). Very fast.

Page 9: Alexander Gelbukh Gelbukh

9

Suffix treesSuffix trees

Text as one long string. No words.o Genetic databases

o Complex queries

o Compacted trie structure

o Problem: space

For text retrieval, inverted files are better

Page 10: Alexander Gelbukh Gelbukh

10

Page 11: Alexander Gelbukh Gelbukh

11

Info for tree comes from the text itself

Page 12: Alexander Gelbukh Gelbukh

12

Suffix arraySuffix array

All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access

Page 13: Alexander Gelbukh Gelbukh

13

Suffix tree and suffix array:Suffix tree and suffix array:Searching. ConstructionSearching. Construction

Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size)

Construction of arrays: sortingo Large text: n2 log (M)/M, more than for inverted fileso Skip details

Addition: n n' log (M)/M. (n' is the size of new portion) Deletion: n

Page 14: Alexander Gelbukh Gelbukh

14

Signature filesSignature files

Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all bits of its pattern are set Sequential search for blocks False drops!

o Design of the hash function

o Have to traverse the block

Good to search ANDs or proximity querieso bit patterns are ORed

Page 15: Alexander Gelbukh Gelbukh

15

False drop: letters in 2nd block

Page 16: Alexander Gelbukh Gelbukh

16

Boolean operationsBoolean operations

Merging file (occurrences) listso AND: to find repetitions

According to query syntax tree Complexity linear in intermediate results

o Can be slow if they are huge

There are optimization techniqueso E.g.: merge small list with a big one by searching

o This is a usual case (Zipf)

Page 17: Alexander Gelbukh Gelbukh

17

Sequential searchSequential search

Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average MANY faster algorithms, but more complicated

o See the book

Page 18: Alexander Gelbukh Gelbukh

18

Approximate string matchingApproximate string matching

Match with k errors, select the one with min k Levenshtein distance between strings s1 and s2

o The minimum number of editing operations to make onefrom another

o Symmetric for standard sets of operations

o Operations: deletion, addition, change

o Sometimes weighted

Solution: dynamic programming. O(mn), O(kn)o m, n are lengths of the two strings

Page 19: Alexander Gelbukh Gelbukh

19

Regular expressionsRegular expressions

Regular expressionso Automation: O (m 2m) + O (n) – bad for long patterns

o There are better methods, see book

Using indices to search for words with errorso Inverted files: search in vocabulary

o Suffix trees and Suffix arrays: the same algorithms as forsearch without errors! Just allow deviations from the path

Page 20: Alexander Gelbukh Gelbukh

20

Search over compressionSearch over compression

Improves both space AND time (less disk operations) Compress query and search

o Huffman compression, words as symbols, bytes (frequencies: most frequent shorter)

o Search each word in the vocabulary its code

o More sophisticated algorithms

Compressed inverted files: less disk less time

Text and index compression can be combined

Page 21: Alexander Gelbukh Gelbukh

21

...compression...compression

Suffix trees can be compressed almost to size ofsuffix arrays

Suffix arrays can’t be compressed (almost random),but can be constructed over compressed texto instead of Huffman, use a code that respects alphabetic order

o almost the same compression

Signature files are sparse, so can be compressedo ratios up to 70%

Page 22: Alexander Gelbukh Gelbukh

22

Page 23: Alexander Gelbukh Gelbukh

23

Research topicsResearch topics

Perhaps, new details in integration of compression and search

“Linguistic” indexing: allowing linguistic variationso Search in plural or only singular

o Search with or without synonyms

Page 24: Alexander Gelbukh Gelbukh

24

ConclusionsConclusions

Inverted files seem to be the best option Other structures are good for specific cases

o Genetic databases

Sequential searching is an integral part of manyindexing-based search techniqueso Many methods to improve sequential searching

Compression can be integrated with search

Page 25: Alexander Gelbukh Gelbukh

25

Thank you!Till April 26, 6 pm