fast and intelligent search in very large amounts of data hannah bast max-planck-institute for...

Fast and Intelligent Search

In Very Large Amounts of Data

Hannah BastMax-Planck-Institute for Informatics

Saarbrücken

Kick-off meeting for Cluster of ExcellenceMultimodal Computing and Interaction

November 13th, 2008

General theme of my group

Searching for information

Fancy and Fast, On Lots of Data

Terabytes of data, hundreds of millions of documents

Query times in a fraction of a second

Beyond Google-style keyword search

+ always open for other real-world algorithmic problemscurrently: route planning in large transportation networks

Searching for Information

Problems we have recently worked on– efficient prefix search

– efficient faceted search

– efficient error-tolerant search

– efficient semantic search

– efficient snippet generation

– efficient index construction

– efficient 3D shape retrieval

Our system: the CompleteSearch engine– efficient

– does all of the above (not the shapes though)

There is a demo this afternoon at 2.30 pm

joint work withthe graphics people

joint work withthe database people

planned joint workwith the CL people

planned: efficientmusic retrieval

Recent Output

Installations

– CompleteSearch DBLP (several million hits / month)

– www.absolventa.de uses CompleteSearch (job search)

– many more: mailing list archives, library search, …

Publications

– Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, …

– Journals: IR, TWEB, TOIS, VLDB Journal, …

Awards

– Jan’08: Meyer-Struckmann Award 15,000 €

– Oct’08: Alcatel-Lucent Award 20,000 €

– big press coverage (e.g, it was on the Heise newsticker)

http://www.absolventa.de/

Faceted Search

Problem– Data: objects with ids and labels

– Query: set of object ids

– Answer: multi-set of labels of the respective objects

– This talk: exactly one label per object

11 22 33 44 55

year:2001 year:1997 year:2003 year:2001 year:2008

Query: I = {1, 3, 4} Answer: {year:2001, year:2003, year:2001}

Faceted Search

Problem– Data: objects with ids and labels

– Query: set of object ids

– Answer: multi-set of labels of the respective objects

– This talk: exactly one label per object

a5a4a3a2a1

Query: I = {1, 3, 4} Answer: {a1, a3, a4}

Trivial if labels are in an array in main memory– but if data is on disk, we have block access to the data

– each read gives us a whole block of B labels

– we have to minimize the number of reads / IO operations

typical: B=10,000

IO-efficient Faceted Search

Precomputation:

– given n elements a1,…,an

– organize in array of size N ≥ n

Query:

– given I = {i1,…, im} с {1,…,n}

– return elements ai1,…, aim

using as few IOs as possible

Extreme solutions:

– space: n #IOs: min{n / B, |I|} (optimal space)

– space: B ∙ (n choose B) #IOs: |I| / B (optimal #IOs)

How much space is needed for which IO-efficiency?

a1 a2 a3 a4 a5 a6 a7 a8

a4 a7 a5 a3 a1 a8 a2 a6

a3 a6 a4 a2 a7 a1 a8 a5n = 8, N = 24

I = {1, 6, 8}, B = 4

get a1, a6, a8 with 1 IO

a1 a8 a2 a6

???

???

A simple lower bound

Theorem:– if we want < |I| IOs for every query I

– we need ≥ n2 / (4∙B) space

Proof:

1. construct graph G with n vertices

edge {i, j} iff ai and aj can be read in one

IO

m ≤ 2B ∙ N

2. by assumption, every I = {i, j} can be

read with 1 IO, hence edge {i, j} exists

m ≥ (n choose 2) ≈ n2 / 2The short queries alone make the problem hard

n = 4, N = 8

B = 2

a1 a2 a3 a4

a1 a4 a2 a3

a1 a2

a4a3

Restrict to large queries

Theorem:– if we want < |I| IOs for all queries with |I| ≥ M

– we need ≥ n2 / (4∙B∙M) space

Proof sketch:

1. construct graph G as before

m ≤ 2B ∙ N

2. Consider arbitrary I with |I| ≥ M

I not independent in G (otherwise |I| IOs necessary)

no independent set larger than M

3. Turan’s theorem implies m ≥ (n choose 2) / M

n = 4, N = 8B = 2

a1 a2 a3 a4

a1 a4 a2 a3

a1 a2

a4a3

so there is hope for queries of size linear in n

and we indeed have a space-efficient algorithm for that case(but no time to explain it here, sorry)

Turán numbers (extremal set theory)

Definition: for n ≥ k ≥ r

T(n, k, r) = the minimal number of r-subsets of {1,…n} such that every k-subset of {1,…,n} contains one of the r-subsets

For r = 2: minimal number of edges in an n-vertex graph, where all independent sets have size < k

Turan’s theorem:

– lim n ∞ T(n, k, r) / (n choose r) exists

– exact value of limit unknown for k ≥ 2

Lower bound

– T(n, k, r) ≥ (r / k)r-1 ∙ (n ch. r)Paul (Pál) Turán*1910 in Budapest†1976 in Budapest

Erdös number 1

Very natural application inthe context of faceted

search!

Route Planning

Route planning in road networks

– from a single source to a single target (point-to-point)

– weighted graph, edge costs = travel times

Transit Node Routing

We invented transit node routing

– 100 times faster than previous best scheme

– Oct’08 SaarLB Award 25.000 €

(together with Stefan Funke, now University of Greifswald)

– integration with previous best scheme published in Science

(joint work with P. Sanders and D. Schultes, Uni Karlsruhe)

– big press coverage

– we are currently trying to market the idea

(via Algorithmic Solutions, a spin-off from MPII D1)

There is a demo this afternoon at 2.00 pm

Google Transit

I am currently @ Google in Zürich

– as “visiting scientist”

– great experience; I can highly recommend it

– one of my projects there is Google Transit

– public transportation networks are completely different from road networks

they can both be modeled as graphs

and that’s about it with the similarity

– the scale is an even bigger challenge there

one node per arrival / departure event

– will publish what I have done at the end of the year

Thank you!

http://www.google.com/transit

Vorberechnung der Transitknoten

Von Distanzen zu Pfaden

24 min20 min23 min

23

2Start Ziel

Overview

How I work

Information retrieval

– overview of problems & results

– our CompleteSearch engine

– recent result: faceted search

Route planning

– ultrafast routing in road networks

– public transportation routing @ Google

Recent Output

Installations

– CompleteSearch DBLP (several million hits / month)

– www.absolventa.de uses CompleteSearch (job search)

– many more: mailing list archives, library search, …

Publications

– Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, …

– Journals: IR, TWEB, TOIS, VLDB Journal, …

Awards

– Jan’08: Meyer-Struckmann Award 15,000 €

– Oct’08: Alcatel-Lucent Award 20,000 €

– Jul’09 : ...... 25,000 €

How I work

I grew up in theoretical computer science

– well-defined, standard problems

– the goal are theorems

– the more difficult / original, the better

– often art for arts sake

– good to learn the art of clear & precise thinking

Then I moved to more applied problems

– work starts with a real problem

– finding the right abstraction is half of the challenge

– think about it, but keep in mind the real problem

– implement + experiment

– build a system and use it / let it be used

necessity is the mother of all inventions

fast and intelligent search in very large amounts of data hannah bast max-planck-institute for...

Documents