the power of prefix search (with a nice open problem) holger bast max-planck-institut für...

39
Unit 2 There’s no shouting and no running.

Upload: opal-spencer

Post on 27-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

The Power of Prefix Search

(with a nice open problem)

Holger BastMax-Planck-Institut für Informatik

Saarbrücken, Germany

Talk at ADS 2007 in Bertinoro, October 3rd

Overview

Part 1

– Definition of our prefix search problem

– Applications

– Demos of our search engine

Part 2

– Problem definition again

– One way to solve it

– Another way to solve it

– Your way to solve it

Part 1

Definition, Applications, Demos

Problem Definition — Formal

Context-Sensitive Prefix Search

Preprocess

– a given collection of text documents such that queries of the following kind can be processed efficiently

Given

– an arbitrary set of documents D

– and a range of words W

Compute

– all word-in-document pairs (w , d) such that w є W and d є D

D98

E B A S

D98

E B A S

D78

K L S

D78

K L SD53

J D E A

D53

J D E A

Problem Definition — Visual

D2

B F A

D2

B F A

D4

K L K A B

D4

K L K A B

D9

E E R

D9

E E R

D27

K L D F

D27

K L D F

D92

P U D E M

D92

P U D E M

D43

D Q

D43

D Q

D32

I L S D H

D32

I L S D H

D1

A O E W H

D1

A O E W H

D88

P A E G Q

D88

P A E G Q

D3

Q DA

D3

Q DA

D17

B WU K A

D17

B WU K A

D74

J W Q

D74

J W Q

D13

A O E W H

D13

A O E W H

D13 D17 D88 …

C D E F G H

Data is given as

– documents containing words

– documents have ids (D1, D2, …)

– words have ids (A, B, C, …)

Query

– given a sorted list of doc ids

– and a range of word ids

Problem Definition — Visual

Data is given as

– documents containing words

– documents have ids (D1, D2, …)

– words have ids (A, B, C, …)

Query

– given a sorted list of doc ids

– and a range of word ids

Answer

– all matching word-in-doc pairs

– with scores

– and positions

D13E0.5 0.2 0.7

D88E

D98

E B A S

D98

E B A S

D78

K L S

D78

K L SD53

J D E A

D53

J D E A

D2

B F A

D2

B F A

D4

K L K A B

D4

K L K A B

D9

E E R

D9

E E R

D27

K L D F

D27

K L D F

D92

P U D E M

D92

P U D E M

D43

D Q

D43

D Q

D32

I L S D H

D32

I L S D H

D1

A O E W H

D1

A O E W H

D88

P A E G Q

D88

P A E G Q

D3

Q DA

D3

Q DA

D17

B WU K A

D17

B WU K A

D74

J W Q

D74

J W Q

D13

A O E W H

D13

A O E W H

D88

P A E G Q

D88

P A E G Q

D17

B WU K A

D17

B WU K A

D13

A O E W H

D13

A O E W H

D13 D17 D88 …

C D E F G H

D88G

5 7 1

Problem Definition — Visual

Data is given as

– documents containing words

– documents have ids (D1, D2, …)

– words have ids (A, B, C, …)

Query

– given a sorted list of doc ids

– and a range of word ids

Answer

– all matching word-in-doc pairs

– with scores

– and positions

D13E0.5 0.2 0.7

D88E

D98

E B A S

D98

E B A S

D78

K L S

D78

K L SD53

J D E A

D53

J D E A

D2

B F A

D2

B F A

D4

K L K A B

D4

K L K A B

D9

E E R

D9

E E R

D27

K L D F

D27

K L D F

D92

P U D E M

D92

P U D E M

D43

D Q

D43

D Q

D32

I L S D H

D32

I L S D H

D1

A O E W H

D1

A O E W H

D88

P A E G Q

D88

P A E G Q

D3

Q DA

D3

Q DA

D17

B WU K A

D17

B WU K A

D74

J W Q

D74

J W Q

D13

A O E W H

D13

A O E W H

D88

P A E G Q

D88

P A E G Q

D17

B WU K A

D17

B WU K A

D13

A O E W H

D13

A O E W H

D13 D17 D88 …

C D E F G H

D88G

5 7 1

Application 1: Autocompletion

After each keystroke

– display completions of the last query word that lead to the best hits, together with the best such hits

– e.g., for the query probabilistic alg display algorithm and algebra and show hits for both

Application 2: Error Correction

As before, but also …

– … display spelling variants of completions that would lead to a hit

– e.g., for the query probabilistic algorithm also consider a document containing probalistic aigorithm

Implementation

– if, say, aigorithm occurs as a misspelling of algorithm, then for every occurrence of aigorithm in the index

aigorithm Doc. 17

also add

algorithm::aiogorithm Doc. 17

Application 3: Query Expansion

As before, but also …

– … display words related to completions that would lead to a hit

– e.g., for the query russia metal also consider documents containing russia aluminium

Implementation

– for, say, every occurrence of aluminium in the index

aluminium Doc. 17

also add (once for every occurrence)

s:67:aluminium Doc. 17

and (one once for the whole collection)

s:aluminium:67 Doc. 00

Application 4: Faceted Search

As before, but also …

– … along with the completions and hits, display a breakdown of the result set by various categories

– e.g., for the query algorithm show (prominent) authors of articles containing these words

Implementation

– for, say, an article by Camil Detrescu that appeared in SODA 2006, add

author:Camil_Demetrescu Doc. 17 venue:SODA Doc. 17 year:2006 Doc. 17

– also add

camil:author:Camil_Demetrescu Doc. 17 demetrescu:author:Camil_Demetrescu Doc. 17etc.

Application 5: Semantic Search

As before, but also …

– … display “semantic” completions

– e.g., for the query beatles musician display instances of the class musician that occur together with the word beatles

Implementation

– cannot simply duplicate index entries of an entity for each category it belongs to, e.g. John Lennon is a

singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, …

– tricky combination of completions and joins SIGIR’07

and still more applications …

Part 2

Solutions and Open Problem

Solution 1: Inverted Index

For example, probab* alg*

given the documents: D13, D17, D88, … (ids of hits for probab*)

and the word range : C D E F G (ids for alg*)

Iterate over all words from the given range

C (algae) D8, D23, D291, ...

D (algarve) D24, D36, D165, ...

E (algebra) D13, D24, D88, ...

F (algol) D56, D129, D251, ...

G (algorithm) D3, D15, D88, ...

Intersect each list with the given one and merge the results

D13 D88 D88 …E E G …

running time |D|∙ |W| + log |W|∙ merge volume

A General Idea

Precompute inverted lists for ranges of words

1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15

D A C A B A C A D A A B C A C A

Note

– each prefix corresponds to a word range

– ideally precompute list for each possible prefix

– too much space

– but lots of redundancy

list forA-D

Solution 2: AutoTree SPIRE’06 / JIR’07

Trick 1: Relative bit vectors

– the i-th bit of the root node corresponds to the i-th doc

– the i-th bit of any other node corresponds to the i-th set bit of its parent node

aachen-zyskowski1111111111111…

maakeb-zyskowski1001000111101…

maakeb-stream1001110…

corresponds to doc 5

corresponds to doc 5

corresponds to doc 10

Solution 2: AutoTree SPIRE’06 / JIR’07

Tricks 2: Push up the words

– For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node

1 1 1 1 1 1 1 1 1 1 …

1 0 0 0 1 0 0 1 1 1 …

1 0 0 1 1 …

aach

enaa

chen

adva

nce

algo

lal

gorit

hmad

vanc

eaa

chen

art

adva

nce

adva

nce

man

ner

man

ning

max

imal

max

imal

max

imum

map

le

maz

zam

iddl

e

D = 5, 7, 10W = max*

D = 5, 10 (→ 2, 5)report: maximum

D = 5

report: Ø → STOP

Solution 2: AutoTree SPIRE’06 / JIR’07

Tricks 3: divide into blocks

– and build a tree over each block as shown before

Solution 2: AutoTree SPIRE’06 / JIR’07

Tricks 3: divide into blocks

– and build a tree over each block as shown before

Solution 2: AutoTree SPIRE’06 / JIR’07

Tricks 3: divide into blocks

– and build a tree over each block as shown before

Theorem:– query processing time O(|D| + |output|)

– uses no more space than an inverted index

AutoTree Summary:+ output-sensitive

– not IO-efficient (heavy use of bit-rank operations)

– compression not optimal

Parenthesis

Despite its quadratic worst-case complexity, the inverted index is hard to beat in practice

– very simple code

– lists are highly compressible

– perfect locality of access

Number of operations is a deceptive measure

– 100 disk seeks take about half a second

– in that time can read 200 MB of contiguous data(if stored compressed)

– main memory: 100 non-local accesses 10 KB data block

data

Solution 3: HYB

Flat division of word range into blocks1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A

SIGIR’06 / IR’07

list forA-D

2 2 3 3 4 4 7 7 8 8 9 9 11E F G J H I I E F G H J I

list forE-J

1 1 2 3 4 5 6 6 6 8 9 9 9 10 10L N M N N K L M N M K L M K L

list forK-N

Solution 3: HYB

Flat division of word range into blocks

Replace doc ids by gaps and words by frequency ranks:

1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A

+1 +2 +0 +2 +0 +1 +1 +1 +0 +1 +2 +0 +0 +1 +1 +23rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st

Encode both gaps and ranks such that x log2 x bits+0 0 +1 10 +2 110

1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110

10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0

An actual block of HYB

SIGIR’06 / IR’07

Solution 3: HYB

Flat division of word range into blocks

Theorem:

– Let n = number of documents, m = number of words

– If blocks are chosen of equal volume ε ∙ n

– Then query time ε ∙ n and empiricial entropy HHYB ~ (1+ ε) ∙ HINV

1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A

SIGIR’06 / IR’07

HYB Summary:

+ IO-efficient (mere scans of data)

+ very good compression

– not output-sensitive

Open Problem

A solution for context-sensitive prefix search which is both output-sensitive and IO-efficient

– Note: the interesting queries are those with large D and W but small result set

Similar situation for substring search / suffix arrays

– all algorithms with good compression have poor locality of access

But prefix search is easier …

– … and more relevant for text search

Thank you!

INV vs. HYB — Space Consumption

Theorem: The empirical entropy of INV is

Σ ni ∙ (1/ln 2 + log2(n/ni))Theorem: The empirical entropy of HYB with block size ε∙n is

Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))

HOMEOPATHY

44,015 docs 263,817 wordswith positions

WIKIPEDIA2,866,503 docs

6,700,119 words

with positions

TREC .GOV25,204,013 docs

25,263,176 words

no positions

raw size 452 MB 7.4 GB 426 GB

INV 13 MB 0.48 GB 4.6 GB

HYB 14 MB 0.51 GB 4.9 GB

Nice match of theory and practice

ni = number of documents containing i-th word, n = number of

documents

INV vs. HYB — Query Time

HOMEOPATHY44,015 docs

263,817 words5,732 real queries

with proximity

avg : 0.03 secsmax: 0.38 secs

avg : .003 secsmax: 0.06 secs

INV

HYB

WIKIPEDIA2,866,503 docs

6,700,119 words100 random queries

with proximity

avg : 0.17 secsmax: 2.27 secs

avg : 0.05 secsmax: 0.49 secs

Experiment: type ordinary queries from left to right

db , dbl , dblp , dblp un , dblp uni , dblp univ , dblp unive , ...

TREC .GOV25,204,013 docs

25,263,176 words50 TREC queries

no proximity

avg : 0.58 secsmax: 16.83 secs

avg : 0.11 secsmax: 0.86 secs

HYB beats INV by an order of magnitude

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

read decomp. intersect rank history

21% 18% 11% 15% 35%

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

1800 MB/sec

read decomp. intersect rank history

21% 18% 11% 15% 35%

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

1800 MB/sec 300 MB/sec

read decomp. intersect rank history

21% 18% 11% 15% 35%

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

1800 MB/sec 300 MB/sec 16 MB/sec

read decomp. intersect rank history

21% 18% 11% 15% 35%

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

1800 MB/sec 300 MB/sec 16 MB/sec 2 MB/sec

read decomp. intersect rank history

21% 18% 11% 15% 35%

System Design — High Level View

Debugging such an application is hell!

Compute ServerC++

Web ServerPHP

User ClientJavaScript

Basic Problem Definition

Definition: Context-sensitive prefix search and completion

Given a query consisting of

– sorted list D of doc ids Doc15 Doc183 Doc185 Doc17351 …

– range W of word ids Word1893 – Word7329

Compute as a result

– all (w , d) w Є W, d Є D Doc15 Doc15 Doc17351 ...

sorted by doc id Word7014 Word5112 Word2011 …

Refinements

– positions Pos12 Pos73 Pos44 ...

– scores 0.7 0.3 0.5 ...

Basic Problem Definition

For example, dblp uni

– set D = document ids from result for dblp

– range W = word ids of all words starting with uni

→ multi-dimensional query processed as sequence of 1½ dimensional queries

For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author:

D11 D25 D57 D91W25W23W24W24

D23 D54 D56 D58 D69W27W27W23W23W27

Basic Problem Definition

For example, dblp uni

– set D = document ids from result for dblp

– range W = word ids of all words starting with uni

→ multi-dimensional query processed as sequence of 1½ dimensional queries

For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author:

→ efficient, because completions are from small range

D11 D25 D57 D91

W25W23

W24W24

D23 D54 D56 D58 D69

W27W27W23

W23

W27

Conclusions

Context-sensitive prefix search and completion

– is a fundamental operation

supports autocompletion search, semantic search, faceted search, DB-style selects and joins, ontology search, …

– efficient support via HYB index

very good compression properties

perfect locality of access

Some open issues

– integrate top-k query processing

– what else can we do with it?

– very short prefixes