fast and intelligent search in very large amounts of data hannah bast max-planck-institute for...
TRANSCRIPT
Fast and Intelligent Search
In Very Large Amounts of Data
Hannah BastMax-Planck-Institute for Informatics
Saarbrücken
Kick-off meeting for Cluster of ExcellenceMultimodal Computing and Interaction
November 13th, 2008
General theme of my group
Searching for information
Fancy and Fast, On Lots of Data
Terabytes of data, hundreds of millions of documents
Query times in a fraction of a second
Beyond Google-style keyword search
+ always open for other real-world algorithmic problemscurrently: route planning in large transportation networks
Searching for Information
Problems we have recently worked on– efficient prefix search
– efficient faceted search
– efficient error-tolerant search
– efficient semantic search
– efficient snippet generation
– efficient index construction
– efficient 3D shape retrieval
Our system: the CompleteSearch engine– efficient
– does all of the above (not the shapes though)
There is a demo this afternoon at 2.30 pm
joint work withthe graphics people
joint work withthe database people
planned joint workwith the CL people
planned: efficientmusic retrieval
Recent Output
Installations
– CompleteSearch DBLP (several million hits / month)
– www.absolventa.de uses CompleteSearch (job search)
– many more: mailing list archives, library search, …
Publications
– Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, …
– Journals: IR, TWEB, TOIS, VLDB Journal, …
Awards
– Jan’08: Meyer-Struckmann Award 15,000 €
– Oct’08: Alcatel-Lucent Award 20,000 €
– big press coverage (e.g, it was on the Heise newsticker)
Faceted Search
Problem– Data: objects with ids and labels
– Query: set of object ids
– Answer: multi-set of labels of the respective objects
– This talk: exactly one label per object
11 22 33 44 55
year:2001 year:1997 year:2003 year:2001 year:2008
Query: I = {1, 3, 4} Answer: {year:2001, year:2003, year:2001}
Faceted Search
Problem– Data: objects with ids and labels
– Query: set of object ids
– Answer: multi-set of labels of the respective objects
– This talk: exactly one label per object
a5a4a3a2a1
Query: I = {1, 3, 4} Answer: {a1, a3, a4}
Trivial if labels are in an array in main memory– but if data is on disk, we have block access to the data
– each read gives us a whole block of B labels
– we have to minimize the number of reads / IO operations
typical: B=10,000
IO-efficient Faceted Search
Precomputation:
– given n elements a1,…,an
– organize in array of size N ≥ n
Query:
– given I = {i1,…, im} с {1,…,n}
– return elements ai1,…, aim
using as few IOs as possible
Extreme solutions:
– space: n #IOs: min{n / B, |I|} (optimal space)
– space: B ∙ (n choose B) #IOs: |I| / B (optimal #IOs)
How much space is needed for which IO-efficiency?
a1 a2 a3 a4 a5 a6 a7 a8
a4 a7 a5 a3 a1 a8 a2 a6
a3 a6 a4 a2 a7 a1 a8 a5n = 8, N = 24
I = {1, 6, 8}, B = 4
get a1, a6, a8 with 1 IO
a1 a8 a2 a6
???
???
A simple lower bound
Theorem:– if we want < |I| IOs for every query I
– we need ≥ n2 / (4∙B) space
Proof:
1. construct graph G with n vertices
edge {i, j} iff ai and aj can be read in one
IO
m ≤ 2B ∙ N
2. by assumption, every I = {i, j} can be
read with 1 IO, hence edge {i, j} exists
m ≥ (n choose 2) ≈ n2 / 2The short queries alone make the problem hard
n = 4, N = 8
B = 2
a1 a2 a3 a4
a1 a4 a2 a3
a1 a2
a4a3
Restrict to large queries
Theorem:– if we want < |I| IOs for all queries with |I| ≥ M
– we need ≥ n2 / (4∙B∙M) space
Proof sketch:
1. construct graph G as before
m ≤ 2B ∙ N
2. Consider arbitrary I with |I| ≥ M
I not independent in G (otherwise |I| IOs necessary)
no independent set larger than M
3. Turan’s theorem implies m ≥ (n choose 2) / M
n = 4, N = 8B = 2
a1 a2 a3 a4
a1 a4 a2 a3
a1 a2
a4a3
so there is hope for queries of size linear in n
and we indeed have a space-efficient algorithm for that case(but no time to explain it here, sorry)
Turán numbers (extremal set theory)
Definition: for n ≥ k ≥ r
T(n, k, r) = the minimal number of r-subsets of {1,…n} such that every k-subset of {1,…,n} contains one of the r-subsets
For r = 2: minimal number of edges in an n-vertex graph, where all independent sets have size < k
Turan’s theorem:
– lim n ∞ T(n, k, r) / (n choose r) exists
– exact value of limit unknown for k ≥ 2
Lower bound
– T(n, k, r) ≥ (r / k)r-1 ∙ (n ch. r)Paul (Pál) Turán*1910 in Budapest†1976 in Budapest
Erdös number 1
Very natural application inthe context of faceted
search!
Route Planning
Route planning in road networks
– from a single source to a single target (point-to-point)
– weighted graph, edge costs = travel times
Transit Node Routing
We invented transit node routing
– 100 times faster than previous best scheme
– Oct’08 SaarLB Award 25.000 €
(together with Stefan Funke, now University of Greifswald)
– integration with previous best scheme published in Science
(joint work with P. Sanders and D. Schultes, Uni Karlsruhe)
– big press coverage
– we are currently trying to market the idea
(via Algorithmic Solutions, a spin-off from MPII D1)
There is a demo this afternoon at 2.00 pm
Google Transit
I am currently @ Google in Zürich
– as “visiting scientist”
– great experience; I can highly recommend it
– one of my projects there is Google Transit
– public transportation networks are completely different from road networks
they can both be modeled as graphs
and that’s about it with the similarity
– the scale is an even bigger challenge there
one node per arrival / departure event
– will publish what I have done at the end of the year
Thank you!
Vorberechnung der Transitknoten
Von Distanzen zu Pfaden
24 min20 min23 min
23
2Start Ziel
Overview
How I work
Information retrieval
– overview of problems & results
– our CompleteSearch engine
– recent result: faceted search
Route planning
– ultrafast routing in road networks
– public transportation routing @ Google
Recent Output
Installations
– CompleteSearch DBLP (several million hits / month)
– www.absolventa.de uses CompleteSearch (job search)
– many more: mailing list archives, library search, …
Publications
– Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, …
– Journals: IR, TWEB, TOIS, VLDB Journal, …
Awards
– Jan’08: Meyer-Struckmann Award 15,000 €
– Oct’08: Alcatel-Lucent Award 20,000 €
– Jul’09 : ...... 25,000 €
How I work
I grew up in theoretical computer science
– well-defined, standard problems
– the goal are theorems
– the more difficult / original, the better
– often art for arts sake
– good to learn the art of clear & precise thinking
Then I moved to more applied problems
– work starts with a real problem
– finding the right abstraction is half of the challenge
– think about it, but keep in mind the real problem
– implement + experiment
– build a system and use it / let it be used
necessity is the mother of all inventions