efficient search in very large text collections, databases, and ontologies holger bast...

57
Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority Programme “Algorithm Engineering” Kickoff Meeting in Karlsruhe, December 2 – 3, 2007

Upload: franklin-roberts

Post on 02-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Efficient Search in Very Large Text Collections, Databases, and

Ontologies

Holger Bast

Max-Planck-Institut für InformatikSaarbrücken, Germany

DFG Priority Programme “Algorithm Engineering”Kickoff Meeting in Karlsruhe, December 2 – 3, 2007

Page 2: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

General theme of this project

Search engines

– large variety of challenging algorithmic problems with high practical relevance

– algorithm engineering is absolutely essential

Focus on scalability

– terabytes of data, hundreds of millions of documents

– query times in a fraction of a second

Focus on advanced queries

– beyond Google-style keyword search

– but still as efficient in time and space

Fancy Searches, yet Fast

efficiency is often a secondary issue

in DB, AI, CL, or ML research

Page 3: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Problems encountered in this project

Indexing: fast queries, succinct index, fast construction

– Index structures for advanced queries (beyond keyword search)

– How to build them fast

Learning from text: scalable, yet effective

– large-scale spelling correction

– large-scale synonymy detection

– large-scale entity annotation

“Basic Toolbox” (for search)

– fast intersection of (sorted) sequences

– efficient (de)compression

I will give a few glimpses in the following

algorythm algorithm

web ≈ internet

Einstein the physicist? the physical unit? the musicologist?

possible synergies with Peter Sanders’

project

Page 4: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Prefix Completion

Fundamental search problem

– definition on next slide

– many notoriously difficult search problems can be reduced to it

– for example, faceted search:

for, say, an article by Peter Sanders that appeared in WEA 2007, add

author:Peter Sanders Doc. 17 venue:WEA Doc. 17 year:2007 Doc. 17

Page 5: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

D98

E B A S

D98

E B A S

D78

K L S

D78

K L SD53

J D E A

D53

J D E A

Prefix Completion — Problem Definition

D2

B F A

D2

B F A

D4

K L K A B

D4

K L K A B

D9

E E R

D9

E E R

D27

K L D F

D27

K L D F

D92

P U D E M

D92

P U D E M

D43

D Q

D43

D Q

D32

I L S D H

D32

I L S D H

D1

A O E W H

D1

A O E W H

D88

P A E G Q

D88

P A E G Q

D3

Q DA

D3

Q DA

D17

B WU K A

D17

B WU K A

D74

J W Q

D74

J W Q

D13

A O E W H

D13

A O E W H

D13 D17 D88 …

C D E F G

Data is given as

– documents containing words

– documents have ids (D1, D2, …)

– words have ids (A, B, C, …)

Query

– given a sorted list of doc ids

– and a range of word ids

Page 6: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Prefix Completion — Problem Definition

Data is given as

– documents containing words

– documents have ids (D1, D2, …)

– words have ids (A, B, C, …)

Query

– given a sorted list of doc ids

– and a range of word ids

Answer

– all matching word-in-doc pairs

– with scores

– and positions

D13E0.5 0.2 0.7

D88E

D98

E B A S

D98

E B A S

D78

K L S

D78

K L SD53

J D E A

D53

J D E A

D2

B F A

D2

B F A

D4

K L K A B

D4

K L K A B

D9

E E R

D9

E E R

D27

K L D F

D27

K L D F

D92

P U D E M

D92

P U D E M

D43

D Q

D43

D Q

D32

I L S D H

D32

I L S D H

D1

A O E W H

D1

A O E W H

D88

P A E G Q

D88

P A E G Q

D3

Q DA

D3

Q DA

D17

B WU K A

D17

B WU K A

D74

J W Q

D74

J W Q

D13

A O E W H

D13

A O E W H

D88

P A E G Q

D88

P A E G Q

D17

B WU K A

D17

B WU K A

D13

A O E W H

D13

A O E W H

D13 D17 D88 …

C D E F G

D88G

5 7 1

Page 7: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Prefix Completion — Problem Definition

Data is given as

– documents containing words

– documents have ids (D1, D2, …)

– words have ids (A, B, C, …)

Query

– given a sorted list of doc ids

– and a range of word ids

Answer

– all matching word-in-doc pairs

– with scores

– and positions

D13E0.5 0.2 0.7

D88E

D98

E B A S

D98

E B A S

D78

K L S

D78

K L SD53

J D E A

D53

J D E A

D2

B F A

D2

B F A

D4

K L K A B

D4

K L K A B

D9

E E R

D9

E E R

D27

K L D F

D27

K L D F

D92

P U D E M

D92

P U D E M

D43

D Q

D43

D Q

D32

I L S D H

D32

I L S D H

D1

A O E W H

D1

A O E W H

D88

P A E G Q

D88

P A E G Q

D3

Q DA

D3

Q DA

D17

B WU K A

D17

B WU K A

D74

J W Q

D74

J W Q

D13

A O E W H

D13

A O E W H

D88

P A E G Q

D88

P A E G Q

D17

B WU K A

D17

B WU K A

D13

A O E W H

D13

A O E W H

D13 D17 D88 …

C D E F G

D88G

5 7 1

Page 8: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Prefix Completion — via the Inverted Index

For example, algor* eng*

given the documents: D13, D17, D88, … (ids of hits for algor*)

and the word range : C D E F G (ids for eng*)

Iterate over all words from the given range

C (engage) D8, D23, D291, ...

D (engel) D24, D36, D165, ...

E (engine) D13, D24, D88, ...

F (engines) D56, D129, D251, ...

G (engineering) D3, D15, D88, ...

Intersect each list with the given one and merge the results

D13 D88 D88 …E E G …

running time |D|∙ |W| + log |W|∙ merge volume

Page 9: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Prefix Completion — Status Quo & Problems

The inverted index

– highly compressible

– perfect locality of access (T operations T / block size IOs)

– but quadratic worst-case complexity

AutoTree [Bast, Weber, Mortensen, SPIRE’06]

– output-sensitive (query time linear in size of output)

– but poor locality of access (heavy use of bit rank operations)

The half-inverted index [Bast, Weber, SIGIR’06]

– highly compressible + perfect locality of access

– query time linear in the number of docs, with small constant

Major open problem: output-sensitive and IO-efficient

Note: time for 100 disk seeks = time for

reading 200 MB of compressed data

99% correlation withactual running times

perfect prediction of time & space

consum.

Page 10: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Error-Tolerant Search

With prefix search available, reduces to the following

– Problem: Given a set of distinct words (lexicon), find all clusters of words that are spelling variants of each other

algorithm

algorytmalogrithm

logaythm

logarithm

mahcine

machine

maschine

Challenges

– find appropriate measure of distance between words

– algorithm that scales in theory as well as in practice

Master thesis of Marjan Celikik (talk on Wednesday)

possible synergies with Ernst Mayr’s

project

Page 11: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Semantic Search — Problems

Problem 1: how to index

– previous engines built on top of DBMS (e.g., Oracle)

– DBMSs are hard to control (opposite of algorithm engineering)

– ongoing work: reduction to prefix search and join

Problem 2: integrate an ontology

– relate words / phrases in text to entities from ontology

– no time for deep parsing, reasoning etc.

– learn from neighboring words

– numerous algorithmic and engineering problems to make it scale to something like Wikipedia (> 10,000,000,000 words)

Data Base Management System

Page 12: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Semantic Search — Entity Recognition

Recognize entities by looking at neighboring words

Quantum inequalities

Einstein's theory of General Relativity amounts to a description …

Quantum inequalities

Einstein's theory of General Relativity amounts to a description …

Albert Einstein, the physicist

is a: physicist, mathematician, vegetarian, person, entity, …

born in: 1879

Violin Sonata No. 5

…, according to Einstein's Mozart: His Character, His Work.

Violin Sonata No. 5

…, according to Einstein's Mozart: His Character, His Work.

Alfred Einstein, the musicologist

is a: musicologist, scholar, intellectual, person, entity, …

born in: 1880

Page 13: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Software

Enhance our prototype

– improve source code, documentation, …

– integrate our results into the system

Make available to others

– public demonstrators

– as a platform for experimentation

– as a fancy search engine construction toolkit

Thank you!

Page 14: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority
Page 15: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

General theme of this project

Project title

Efficient Search in Very Large Text Collections, Databases, and Ontologies

In short

Fancy searches, yet fast

– advanced search, yet highly scalable

– quality is an issue

– but must not sacrifice performance

(as often happens in AI, CL, ML)

General

“Search engines are a fascinating, multi-faceted field of research giving rise to a multitude of challenging algorithmic problems with a strong algorithm engineering component and of high practical relevance.“

Page 16: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Overview [just for myself not for the talk]

An Index for prefix search

– inverted index + our + open problem + top-k

Building such an index

– INV = sorting, HYB = semi-sorting

Error-tolerant search

– reduce to spelling variants clustering, define problem

Semantic Search

– point out entity annotation problem

Page 17: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Prefix Search

Show demo

– first explain prefix search

– then how to use if for faceted search

– use DBLP + show dblp.mpi-inf.mpg.de

Explain inverted index

– show for example prefix query

– point out IO-efficiency

– point out compressability

– but quadratic worst-case complexity

Page 18: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Problems encountered in this project

Indexing: fast queries, succinct index, fast construction

– Index structures for advanced queries (beyond keyword search)

– How to build them fast

Learning from text: scalable, yet effective

– large-scale spelling correction

– large-scale synonymy detection

– large-scale entity annotation

Fundamental problems

– fast intersection of (sorted) sequences

– efficient (de)compression

I will explain each of these in detail in the following

algorythm algorithm

web ≈ internet

Einstein the physicist? the physical unit? the musicologist?

Page 19: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Problems encountered in this project

Indexing: fast queries, succinct index, fast construction

– Index structures for advanced queries (beyond keyword search)

– How to build them fast

Learning from text: scalable, yet effective

– large-scale spelling correction

– large-scale synonymy detection

– large-scale entity annotation

Fundamental problems

– fast intersection of (sorted) sequences

– efficient (de)compression

just kidding

algorythm algorithm

web ≈ internet

Einstein the physicist? the physical unit? the musicologist?

Page 20: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Problems encountered in this project

Indexing: fast queries, succinct index, fast construction

– Index structures for advanced queries (beyond keyword search)

– How to build them fast

Learning from text: scalable, yet effective

– large-scale spelling correction

– large-scale synonymy detection

– large-scale entity annotation

Fundamental problems

– fast intersection of (sorted) sequences

– efficient (de)compression

I will give you a glimpse of some of these in the following

algorythm algorithm

web ≈ internet

Einstein the physicist? the physical unit? the musicologist?

Example: prefix search

Demo + problem definition

Demo

Page 21: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority
Page 22: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Overview

Part 1

– Definition of our prefix search problem

– Applications

– Demos of our search engine

Part 2

– Problem definition again

– One way to solve it

– Another way to solve it

– Your way to solve it

Page 23: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Part 1

Definition, Applications, Demos

Page 24: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Problem Definition — Formal

Context-Sensitive Prefix Search

Preprocess

– a given collection of text documents such that queries of the following kind can be processed efficiently

Given

– an arbitrary set of documents D

– and a range of words W

Compute

– all word-in-document pairs (w , d) such that w є W and d є D

Page 25: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

D98

E B A S

D98

E B A S

D78

K L S

D78

K L SD53

J D E A

D53

J D E A

Problem Definition — Visual

D2

B F A

D2

B F A

D4

K L K A B

D4

K L K A B

D9

E E R

D9

E E R

D27

K L D F

D27

K L D F

D92

P U D E M

D92

P U D E M

D43

D Q

D43

D Q

D32

I L S D H

D32

I L S D H

D1

A O E W H

D1

A O E W H

D88

P A E G Q

D88

P A E G Q

D3

Q DA

D3

Q DA

D17

B WU K A

D17

B WU K A

D74

J W Q

D74

J W Q

D13

A O E W H

D13

A O E W H

D13 D17 D88 …

C D E F G

Data is given as

– documents containing words

– documents have ids (D1, D2, …)

– words have ids (A, B, C, …)

Query

– given a sorted list of doc ids

– and a range of word ids

Page 26: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Problem Definition — Visual

Data is given as

– documents containing words

– documents have ids (D1, D2, …)

– words have ids (A, B, C, …)

Query

– given a sorted list of doc ids

– and a range of word ids

Answer

– all matching word-in-doc pairs

– with scores

– and positions

D13E0.5 0.2 0.7

D88E

D98

E B A S

D98

E B A S

D78

K L S

D78

K L SD53

J D E A

D53

J D E A

D2

B F A

D2

B F A

D4

K L K A B

D4

K L K A B

D9

E E R

D9

E E R

D27

K L D F

D27

K L D F

D92

P U D E M

D92

P U D E M

D43

D Q

D43

D Q

D32

I L S D H

D32

I L S D H

D1

A O E W H

D1

A O E W H

D88

P A E G Q

D88

P A E G Q

D3

Q DA

D3

Q DA

D17

B WU K A

D17

B WU K A

D74

J W Q

D74

J W Q

D13

A O E W H

D13

A O E W H

D88

P A E G Q

D88

P A E G Q

D17

B WU K A

D17

B WU K A

D13

A O E W H

D13

A O E W H

D13 D17 D88 …

C D E F G

D88G

5 7 1

Page 27: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Problem Definition — Visual

Data is given as

– documents containing words

– documents have ids (D1, D2, …)

– words have ids (A, B, C, …)

Query

– given a sorted list of doc ids

– and a range of word ids

Answer

– all matching word-in-doc pairs

– with scores

– and positions

D13E0.5 0.2 0.7

D88E

D98

E B A S

D98

E B A S

D78

K L S

D78

K L SD53

J D E A

D53

J D E A

D2

B F A

D2

B F A

D4

K L K A B

D4

K L K A B

D9

E E R

D9

E E R

D27

K L D F

D27

K L D F

D92

P U D E M

D92

P U D E M

D43

D Q

D43

D Q

D32

I L S D H

D32

I L S D H

D1

A O E W H

D1

A O E W H

D88

P A E G Q

D88

P A E G Q

D3

Q DA

D3

Q DA

D17

B WU K A

D17

B WU K A

D74

J W Q

D74

J W Q

D13

A O E W H

D13

A O E W H

D88

P A E G Q

D88

P A E G Q

D17

B WU K A

D17

B WU K A

D13

A O E W H

D13

A O E W H

D13 D17 D88 …

C D E F G

D88G

5 7 1

Page 28: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Application 1: Autocompletion

After each keystroke

– display completions of the last query word that lead to the best hits, together with the best such hits

– e.g., for the query google amp display amphitheatre and the corresponding hits

Page 29: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Application 2: Error Correction

As before, but also …

– … display spelling variants of completions that would lead to a hit

– e.g., for the query probabilistic algorithm also consider a document containing probalistic aigorithm

Implementation

– if, say, aigorithm occurs as a misspelling of algorithm, then for every occurrence of aigorithm in the index

aigorithm Doc. 17

also add

algorithm::aigorithm Doc. 17

Page 30: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Application 3: Query Expansion

As before, but also …

– … display words related to completions that would lead to a hit

– e.g., for the query russia metal also consider documents containing russia aluminium

Implementation

– for, say, every occurrence of aluminium in the index

aluminium Doc. 17

also add (once for every occurrence)

s:67:aluminium Doc. 17

and (one once for the whole collection)

s:aluminium:67 Doc. 00

Page 31: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Application 4: Faceted Search

As before, but also …

– … along with the completions and hits, display a breakdown of the result set by various categories

– e.g., for the query algorithm show (prominent) authors of articles containing these words

Implementation

– for, say, an article by Thomas Hofmann that appeared in NIPS 2004, add

author:Thomas_Hofmann Doc. 17 venue:NIPS Doc. 17 year:2004 Doc. 17

– also add

thomas:author:Thomas_Hofmann Doc. 17 hofmann:author:Thomas_Hofmann Doc. 17etc.

Page 32: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Application 5: Semantic Search

As before, but also …

– … display “semantic” completions

– e.g., for the query beatles musician display instances of the class musician that occur together with the word beatles

Implementation

– cannot simply duplicate index entries of an entity for each category it belongs to, e.g. John Lennon is a

singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, …

– tricky combination of completions and joins SIGIR’07

and still more applications …

Page 33: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Part 2

Solutions and Open Problem

Page 34: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Solution 1: Inverted Index

For example, probab* alg*

given the documents: D13, D17, D88, … (ids of hits for probab*)

and the word range : C D E F G (ids for alg*)

Iterate over all words from the given range

C (algae) D8, D23, D291, ...

D (algarve) D24, D36, D165, ...

E (algebra) D13, D24, D88, ...

F (algol) D56, D129, D251, ...

G (algorithm) D3, D15, D88, ...

Intersect each list with the given one and merge the results

D13 D88 D88 …E E G …

running time |D|∙ |W| + log |W|∙ merge volume

Page 35: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

A General Idea

Precompute inverted lists for ranges of words

1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15

D A C A B A C A D A A B C A C A

Note

– each prefix corresponds to a word range

– ideally precompute list for each possible prefix

– too much space

– but lots of redundancy

list forA-D

Page 36: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Solution 2: AutoTree SPIRE’06 / JIR’07

Trick 1: Relative bit vectors

– the i-th bit of the root node corresponds to the i-th doc

– the i-th bit of any other node corresponds to the i-th set bit of its parent node

aachen-zyskowski1111111111111…

maakeb-zyskowski1001000111101…

maakeb-stream1001110…

corresponds to doc 5

corresponds to doc 5

corresponds to doc 10

Page 37: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Solution 2: AutoTree SPIRE’06 / JIR’07

Tricks 2: Push up the words

– For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node

1 1 1 1 1 1 1 1 1 1 …

1 0 0 0 1 0 0 1 1 1 …

1 0 0 1 1 …

aach

enaa

chen

adva

nce

algo

lal

gorit

hmad

vanc

eaa

chen

art

adva

nce

adva

nce

man

ner

man

ning

max

imal

max

imal

max

imum

map

le

maz

zam

iddl

e

D = 5, 7, 10W = max*

D = 5, 10 (→ 2, 5)report: maximum

D = 5

report: Ø → STOP

Page 38: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Solution 2: AutoTree SPIRE’06 / JIR’07

Tricks 3: divide into blocks

– and build a tree over each block as shown before

Page 39: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Solution 2: AutoTree SPIRE’06 / JIR’07

Tricks 3: divide into blocks

– and build a tree over each block as shown before

Page 40: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Solution 2: AutoTree SPIRE’06 / JIR’07

Tricks 3: divide into blocks

– and build a tree over each block as shown before

Theorem:– query processing time O(|D| + |output|)

– uses no more space than an inverted index

AutoTree Summary:+ output-sensitive

– not IO-efficient (heavy use of bit-rank operations)

– compression not optimal

99% correlation with

actual running times

Page 41: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Parenthesis

Despite its quadratic worst-case complexity, the inverted index is hard to beat in practice

– very simple code

– lists are highly compressible

– perfect locality of access

Number of operations is a deceptive measure

– 100 disk seeks take about half a second

– in that time can read 200 MB of contiguous data(if stored compressed)

– main memory: 100 non-local accesses 10 KB data block

data

Page 42: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Solution 3: HYB

Flat division of word range into blocks1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A

SIGIR’06 / IR’07

list forA-D

2 2 3 3 4 4 7 7 8 8 9 9 11E F G J H I I E F G H J I

list forE-J

1 1 2 3 4 5 6 6 6 8 9 9 9 10 10L N M N N K L M N M K L M K L

list forK-N

Page 43: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Solution 3: HYB

Flat division of word range into blocks

Replace doc ids by gaps and words by frequency ranks:

1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A

+1 +2 +0 +2 +0 +1 +1 +1 +0 +1 +2 +0 +0 +1 +1 +23rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st

Encode both gaps and ranks such that x log2 x bits+0 0 +1 10 +2 110

1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110

10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0

An actual block of HYB

SIGIR’06 / IR’07

Page 44: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Solution 3: HYB

Flat division of word range into blocks

Theorem:

– Let n = number of documents, m = number of words

– If blocks are chosen of equal volume ~ n

– Then query time ~ n and empiricial entropy HHYB ~ (1+ ε) ∙ HINV

1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A

SIGIR’06 / IR’07

HYB Summary:

+ IO-efficient (mere scans of data)

+ very good compression

– not output-sensitive

experimental results

match perfectly

Page 45: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Conclusion

Context-sensitive prefix search

– core mechanism of the CompleteSearch engine

– simple enough to allow efficient realization

– powerful enough to support many advanced search features

Open problems

– solution which is both output-sensitive and IO-efficient

– implement the whole thing using MapReduce

– support yet more features

– …

Thank you!

Page 46: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority
Page 47: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Processing the query “beatles musician”

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …

entity:john_lennonentity:1964entity:liverpooletc.

entity:wolfang_amadeus_mozartentity:johann_sebastian_bachentity:john_lennonetc.

entity:john_lennonetc.

twoprefix

queries

onejoin

position

beatles entity:* entity:* . relation:is_a .

class:musician

Page 48: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Processing the query “beatles musician”

Problem: entity:* has a huge number of occurrences– ≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences– prefix search efficient only for up to ≈ 1% (explanation follows)

Solution: frontier classes– classes at “appropriate” level in the hierarchy– e.g.: artist, believer, worker, vegetable, animal, …

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …

position

beatles entity:* entity:* . relation:is_a .

class:musician

Page 49: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Processing the query “beatles musician”

Gitanes

… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …

Gitanes

… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …

John Lennon

0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …

John Lennon

0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …

artist:john_lennonartist:graham_greeneartist:pete_bestetc.

artist:wolfang_amadeus_mozartartist:johann_sebastian_bachartist:john_lennonetc.

artist:john_lennonetc.

position

beatles artist:* artist:* . relation:is_a .

class:musiciantwoprefix

queries

onejoin

first figure out:musician artist

(easy)

Page 50: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

INV vs. HYB — Space Consumption

Theorem: The empirical entropy of INV is

Σ ni ∙ (1/ln 2 + log2(n/ni))Theorem: The empirical entropy of HYB with block size ε∙n is

Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))

HOMEOPATHY

44,015 docs 263,817 wordswith positions

WIKIPEDIA2,866,503 docs

6,700,119 words

with positions

TREC .GOV25,204,013 docs

25,263,176 words

no positions

raw size 452 MB 7.4 GB 426 GB

INV 13 MB 0.48 GB 4.6 GB

HYB 14 MB 0.51 GB 4.9 GB

Nice match of theory and practice

ni = number of documents containing i-th word, n = number of

documents

Page 51: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

INV vs. HYB — Query Time

HOMEOPATHY44,015 docs

263,817 words5,732 real queries

with proximity

avg : 0.03 secsmax: 0.38 secs

avg : .003 secsmax: 0.06 secs

INV

HYB

WIKIPEDIA2,866,503 docs

6,700,119 words100 random queries

with proximity

avg : 0.17 secsmax: 2.27 secs

avg : 0.05 secsmax: 0.49 secs

Experiment: type ordinary queries from left to right

db , dbl , dblp , dblp un , dblp uni , dblp univ , dblp unive , ...

TREC .GOV25,204,013 docs

25,263,176 words50 TREC queries

no proximity

avg : 0.58 secsmax: 16.83 secs

avg : 0.11 secsmax: 0.86 secs

HYB beats INV by an order of magnitude

Page 52: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

read decomp. intersect rank history

21% 18% 11% 15% 35%

Page 53: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

1800 MB/sec

read decomp. intersect rank history

21% 18% 11% 15% 35%

Page 54: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

1800 MB/sec 300 MB/sec

read decomp. intersect rank history

21% 18% 11% 15% 35%

Page 55: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

1800 MB/sec 300 MB/sec 16 MB/sec

read decomp. intersect rank history

21% 18% 11% 15% 35%

Page 56: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

Engineering

Careful implementation in C++

– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

With HYB, every query is essentially one block scan

– perfect locality of access, no sorting or merging, etc.

– balanced ratio of read, decompression, processing, etc.

C++ Java MySQL Perl

1800 MB/sec 300 MB/sec 16 MB/sec 2 MB/sec

read decomp. intersect rank history

21% 18% 11% 15% 35%

Page 57: Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority

System Design — High Level View

Debugging such an application is hell!

Compute ServerC++

Web ServerPHP

User ClientJavaScript