efficient search in very large text collections, databases, and ontologies holger bast...
TRANSCRIPT
Efficient Search in Very Large Text Collections, Databases, and
Ontologies
Holger Bast
Max-Planck-Institut für InformatikSaarbrücken, Germany
DFG Priority Programme “Algorithm Engineering”Kickoff Meeting in Karlsruhe, December 2 – 3, 2007
General theme of this project
Search engines
– large variety of challenging algorithmic problems with high practical relevance
– algorithm engineering is absolutely essential
Focus on scalability
– terabytes of data, hundreds of millions of documents
– query times in a fraction of a second
Focus on advanced queries
– beyond Google-style keyword search
– but still as efficient in time and space
Fancy Searches, yet Fast
efficiency is often a secondary issue
in DB, AI, CL, or ML research
Problems encountered in this project
Indexing: fast queries, succinct index, fast construction
– Index structures for advanced queries (beyond keyword search)
– How to build them fast
Learning from text: scalable, yet effective
– large-scale spelling correction
– large-scale synonymy detection
– large-scale entity annotation
“Basic Toolbox” (for search)
– fast intersection of (sorted) sequences
– efficient (de)compression
I will give a few glimpses in the following
algorythm algorithm
web ≈ internet
Einstein the physicist? the physical unit? the musicologist?
possible synergies with Peter Sanders’
project
Prefix Completion
Fundamental search problem
– definition on next slide
– many notoriously difficult search problems can be reduced to it
– for example, faceted search:
for, say, an article by Peter Sanders that appeared in WEA 2007, add
author:Peter Sanders Doc. 17 venue:WEA Doc. 17 year:2007 Doc. 17
D98
E B A S
D98
E B A S
D78
K L S
D78
K L SD53
J D E A
D53
J D E A
Prefix Completion — Problem Definition
D2
B F A
D2
B F A
D4
K L K A B
D4
K L K A B
D9
E E R
D9
E E R
D27
K L D F
D27
K L D F
D92
P U D E M
D92
P U D E M
D43
D Q
D43
D Q
D32
I L S D H
D32
I L S D H
D1
A O E W H
D1
A O E W H
D88
P A E G Q
D88
P A E G Q
D3
Q DA
D3
Q DA
D17
B WU K A
D17
B WU K A
D74
J W Q
D74
J W Q
D13
A O E W H
D13
A O E W H
D13 D17 D88 …
C D E F G
Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)
Query
– given a sorted list of doc ids
– and a range of word ids
Prefix Completion — Problem Definition
Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)
Query
– given a sorted list of doc ids
– and a range of word ids
Answer
– all matching word-in-doc pairs
– with scores
– and positions
D13E0.5 0.2 0.7
…
D88E
…
…
D98
E B A S
D98
E B A S
D78
K L S
D78
K L SD53
J D E A
D53
J D E A
D2
B F A
D2
B F A
D4
K L K A B
D4
K L K A B
D9
E E R
D9
E E R
D27
K L D F
D27
K L D F
D92
P U D E M
D92
P U D E M
D43
D Q
D43
D Q
D32
I L S D H
D32
I L S D H
D1
A O E W H
D1
A O E W H
D88
P A E G Q
D88
P A E G Q
D3
Q DA
D3
Q DA
D17
B WU K A
D17
B WU K A
D74
J W Q
D74
J W Q
D13
A O E W H
D13
A O E W H
D88
P A E G Q
D88
P A E G Q
D17
B WU K A
D17
B WU K A
D13
A O E W H
D13
A O E W H
D13 D17 D88 …
C D E F G
D88G
5 7 1
…
Prefix Completion — Problem Definition
Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)
Query
– given a sorted list of doc ids
– and a range of word ids
Answer
– all matching word-in-doc pairs
– with scores
– and positions
D13E0.5 0.2 0.7
…
D88E
…
…
D98
E B A S
D98
E B A S
D78
K L S
D78
K L SD53
J D E A
D53
J D E A
D2
B F A
D2
B F A
D4
K L K A B
D4
K L K A B
D9
E E R
D9
E E R
D27
K L D F
D27
K L D F
D92
P U D E M
D92
P U D E M
D43
D Q
D43
D Q
D32
I L S D H
D32
I L S D H
D1
A O E W H
D1
A O E W H
D88
P A E G Q
D88
P A E G Q
D3
Q DA
D3
Q DA
D17
B WU K A
D17
B WU K A
D74
J W Q
D74
J W Q
D13
A O E W H
D13
A O E W H
D88
P A E G Q
D88
P A E G Q
D17
B WU K A
D17
B WU K A
D13
A O E W H
D13
A O E W H
D13 D17 D88 …
C D E F G
D88G
5 7 1
…
Prefix Completion — via the Inverted Index
For example, algor* eng*
given the documents: D13, D17, D88, … (ids of hits for algor*)
and the word range : C D E F G (ids for eng*)
Iterate over all words from the given range
C (engage) D8, D23, D291, ...
D (engel) D24, D36, D165, ...
E (engine) D13, D24, D88, ...
F (engines) D56, D129, D251, ...
G (engineering) D3, D15, D88, ...
Intersect each list with the given one and merge the results
D13 D88 D88 …E E G …
running time |D|∙ |W| + log |W|∙ merge volume
Prefix Completion — Status Quo & Problems
The inverted index
– highly compressible
– perfect locality of access (T operations T / block size IOs)
– but quadratic worst-case complexity
AutoTree [Bast, Weber, Mortensen, SPIRE’06]
– output-sensitive (query time linear in size of output)
– but poor locality of access (heavy use of bit rank operations)
The half-inverted index [Bast, Weber, SIGIR’06]
– highly compressible + perfect locality of access
– query time linear in the number of docs, with small constant
Major open problem: output-sensitive and IO-efficient
Note: time for 100 disk seeks = time for
reading 200 MB of compressed data
99% correlation withactual running times
perfect prediction of time & space
consum.
Error-Tolerant Search
With prefix search available, reduces to the following
– Problem: Given a set of distinct words (lexicon), find all clusters of words that are spelling variants of each other
algorithm
algorytmalogrithm
logaythm
logarithm
mahcine
machine
maschine
Challenges
– find appropriate measure of distance between words
– algorithm that scales in theory as well as in practice
Master thesis of Marjan Celikik (talk on Wednesday)
possible synergies with Ernst Mayr’s
project
Semantic Search — Problems
Problem 1: how to index
– previous engines built on top of DBMS (e.g., Oracle)
– DBMSs are hard to control (opposite of algorithm engineering)
– ongoing work: reduction to prefix search and join
Problem 2: integrate an ontology
– relate words / phrases in text to entities from ontology
– no time for deep parsing, reasoning etc.
– learn from neighboring words
– numerous algorithmic and engineering problems to make it scale to something like Wikipedia (> 10,000,000,000 words)
Data Base Management System
Semantic Search — Entity Recognition
Recognize entities by looking at neighboring words
Quantum inequalities
Einstein's theory of General Relativity amounts to a description …
Quantum inequalities
Einstein's theory of General Relativity amounts to a description …
Albert Einstein, the physicist
is a: physicist, mathematician, vegetarian, person, entity, …
born in: 1879
Violin Sonata No. 5
…, according to Einstein's Mozart: His Character, His Work.
Violin Sonata No. 5
…, according to Einstein's Mozart: His Character, His Work.
Alfred Einstein, the musicologist
is a: musicologist, scholar, intellectual, person, entity, …
born in: 1880
Software
Enhance our prototype
– improve source code, documentation, …
– integrate our results into the system
Make available to others
– public demonstrators
– as a platform for experimentation
– as a fancy search engine construction toolkit
Thank you!
General theme of this project
Project title
Efficient Search in Very Large Text Collections, Databases, and Ontologies
In short
Fancy searches, yet fast
– advanced search, yet highly scalable
– quality is an issue
– but must not sacrifice performance
(as often happens in AI, CL, ML)
General
“Search engines are a fascinating, multi-faceted field of research giving rise to a multitude of challenging algorithmic problems with a strong algorithm engineering component and of high practical relevance.“
Overview [just for myself not for the talk]
An Index for prefix search
– inverted index + our + open problem + top-k
Building such an index
– INV = sorting, HYB = semi-sorting
Error-tolerant search
– reduce to spelling variants clustering, define problem
Semantic Search
– point out entity annotation problem
Prefix Search
Show demo
– first explain prefix search
– then how to use if for faceted search
– use DBLP + show dblp.mpi-inf.mpg.de
Explain inverted index
– show for example prefix query
– point out IO-efficiency
– point out compressability
– but quadratic worst-case complexity
Problems encountered in this project
Indexing: fast queries, succinct index, fast construction
– Index structures for advanced queries (beyond keyword search)
– How to build them fast
Learning from text: scalable, yet effective
– large-scale spelling correction
– large-scale synonymy detection
– large-scale entity annotation
Fundamental problems
– fast intersection of (sorted) sequences
– efficient (de)compression
I will explain each of these in detail in the following
algorythm algorithm
web ≈ internet
Einstein the physicist? the physical unit? the musicologist?
Problems encountered in this project
Indexing: fast queries, succinct index, fast construction
– Index structures for advanced queries (beyond keyword search)
– How to build them fast
Learning from text: scalable, yet effective
– large-scale spelling correction
– large-scale synonymy detection
– large-scale entity annotation
Fundamental problems
– fast intersection of (sorted) sequences
– efficient (de)compression
just kidding
algorythm algorithm
web ≈ internet
Einstein the physicist? the physical unit? the musicologist?
Problems encountered in this project
Indexing: fast queries, succinct index, fast construction
– Index structures for advanced queries (beyond keyword search)
– How to build them fast
Learning from text: scalable, yet effective
– large-scale spelling correction
– large-scale synonymy detection
– large-scale entity annotation
Fundamental problems
– fast intersection of (sorted) sequences
– efficient (de)compression
I will give you a glimpse of some of these in the following
algorythm algorithm
web ≈ internet
Einstein the physicist? the physical unit? the musicologist?
Example: prefix search
Demo + problem definition
Demo
Overview
Part 1
– Definition of our prefix search problem
– Applications
– Demos of our search engine
Part 2
– Problem definition again
– One way to solve it
– Another way to solve it
– Your way to solve it
Part 1
Definition, Applications, Demos
Problem Definition — Formal
Context-Sensitive Prefix Search
Preprocess
– a given collection of text documents such that queries of the following kind can be processed efficiently
Given
– an arbitrary set of documents D
– and a range of words W
Compute
– all word-in-document pairs (w , d) such that w є W and d є D
D98
E B A S
D98
E B A S
D78
K L S
D78
K L SD53
J D E A
D53
J D E A
Problem Definition — Visual
D2
B F A
D2
B F A
D4
K L K A B
D4
K L K A B
D9
E E R
D9
E E R
D27
K L D F
D27
K L D F
D92
P U D E M
D92
P U D E M
D43
D Q
D43
D Q
D32
I L S D H
D32
I L S D H
D1
A O E W H
D1
A O E W H
D88
P A E G Q
D88
P A E G Q
D3
Q DA
D3
Q DA
D17
B WU K A
D17
B WU K A
D74
J W Q
D74
J W Q
D13
A O E W H
D13
A O E W H
D13 D17 D88 …
C D E F G
Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)
Query
– given a sorted list of doc ids
– and a range of word ids
Problem Definition — Visual
Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)
Query
– given a sorted list of doc ids
– and a range of word ids
Answer
– all matching word-in-doc pairs
– with scores
– and positions
D13E0.5 0.2 0.7
…
D88E
…
…
D98
E B A S
D98
E B A S
D78
K L S
D78
K L SD53
J D E A
D53
J D E A
D2
B F A
D2
B F A
D4
K L K A B
D4
K L K A B
D9
E E R
D9
E E R
D27
K L D F
D27
K L D F
D92
P U D E M
D92
P U D E M
D43
D Q
D43
D Q
D32
I L S D H
D32
I L S D H
D1
A O E W H
D1
A O E W H
D88
P A E G Q
D88
P A E G Q
D3
Q DA
D3
Q DA
D17
B WU K A
D17
B WU K A
D74
J W Q
D74
J W Q
D13
A O E W H
D13
A O E W H
D88
P A E G Q
D88
P A E G Q
D17
B WU K A
D17
B WU K A
D13
A O E W H
D13
A O E W H
D13 D17 D88 …
C D E F G
D88G
5 7 1
…
Problem Definition — Visual
Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)
Query
– given a sorted list of doc ids
– and a range of word ids
Answer
– all matching word-in-doc pairs
– with scores
– and positions
D13E0.5 0.2 0.7
…
D88E
…
…
D98
E B A S
D98
E B A S
D78
K L S
D78
K L SD53
J D E A
D53
J D E A
D2
B F A
D2
B F A
D4
K L K A B
D4
K L K A B
D9
E E R
D9
E E R
D27
K L D F
D27
K L D F
D92
P U D E M
D92
P U D E M
D43
D Q
D43
D Q
D32
I L S D H
D32
I L S D H
D1
A O E W H
D1
A O E W H
D88
P A E G Q
D88
P A E G Q
D3
Q DA
D3
Q DA
D17
B WU K A
D17
B WU K A
D74
J W Q
D74
J W Q
D13
A O E W H
D13
A O E W H
D88
P A E G Q
D88
P A E G Q
D17
B WU K A
D17
B WU K A
D13
A O E W H
D13
A O E W H
D13 D17 D88 …
C D E F G
D88G
5 7 1
…
Application 1: Autocompletion
After each keystroke
– display completions of the last query word that lead to the best hits, together with the best such hits
– e.g., for the query google amp display amphitheatre and the corresponding hits
Application 2: Error Correction
As before, but also …
– … display spelling variants of completions that would lead to a hit
– e.g., for the query probabilistic algorithm also consider a document containing probalistic aigorithm
Implementation
– if, say, aigorithm occurs as a misspelling of algorithm, then for every occurrence of aigorithm in the index
aigorithm Doc. 17
also add
algorithm::aigorithm Doc. 17
Application 3: Query Expansion
As before, but also …
– … display words related to completions that would lead to a hit
– e.g., for the query russia metal also consider documents containing russia aluminium
Implementation
– for, say, every occurrence of aluminium in the index
aluminium Doc. 17
also add (once for every occurrence)
s:67:aluminium Doc. 17
and (one once for the whole collection)
s:aluminium:67 Doc. 00
Application 4: Faceted Search
As before, but also …
– … along with the completions and hits, display a breakdown of the result set by various categories
– e.g., for the query algorithm show (prominent) authors of articles containing these words
Implementation
– for, say, an article by Thomas Hofmann that appeared in NIPS 2004, add
author:Thomas_Hofmann Doc. 17 venue:NIPS Doc. 17 year:2004 Doc. 17
– also add
thomas:author:Thomas_Hofmann Doc. 17 hofmann:author:Thomas_Hofmann Doc. 17etc.
Application 5: Semantic Search
As before, but also …
– … display “semantic” completions
– e.g., for the query beatles musician display instances of the class musician that occur together with the word beatles
Implementation
– cannot simply duplicate index entries of an entity for each category it belongs to, e.g. John Lennon is a
singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, …
– tricky combination of completions and joins SIGIR’07
and still more applications …
Part 2
Solutions and Open Problem
Solution 1: Inverted Index
For example, probab* alg*
given the documents: D13, D17, D88, … (ids of hits for probab*)
and the word range : C D E F G (ids for alg*)
Iterate over all words from the given range
C (algae) D8, D23, D291, ...
D (algarve) D24, D36, D165, ...
E (algebra) D13, D24, D88, ...
F (algol) D56, D129, D251, ...
G (algorithm) D3, D15, D88, ...
Intersect each list with the given one and merge the results
D13 D88 D88 …E E G …
running time |D|∙ |W| + log |W|∙ merge volume
A General Idea
Precompute inverted lists for ranges of words
1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15
D A C A B A C A D A A B C A C A
Note
– each prefix corresponds to a word range
– ideally precompute list for each possible prefix
– too much space
– but lots of redundancy
list forA-D
Solution 2: AutoTree SPIRE’06 / JIR’07
Trick 1: Relative bit vectors
– the i-th bit of the root node corresponds to the i-th doc
– the i-th bit of any other node corresponds to the i-th set bit of its parent node
aachen-zyskowski1111111111111…
maakeb-zyskowski1001000111101…
maakeb-stream1001110…
corresponds to doc 5
corresponds to doc 5
corresponds to doc 10
Solution 2: AutoTree SPIRE’06 / JIR’07
Tricks 2: Push up the words
– For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node
1 1 1 1 1 1 1 1 1 1 …
1 0 0 0 1 0 0 1 1 1 …
1 0 0 1 1 …
aach
enaa
chen
adva
nce
algo
lal
gorit
hmad
vanc
eaa
chen
art
adva
nce
adva
nce
man
ner
man
ning
max
imal
max
imal
max
imum
map
le
maz
zam
iddl
e
D = 5, 7, 10W = max*
D = 5, 10 (→ 2, 5)report: maximum
D = 5
report: Ø → STOP
Solution 2: AutoTree SPIRE’06 / JIR’07
Tricks 3: divide into blocks
– and build a tree over each block as shown before
Solution 2: AutoTree SPIRE’06 / JIR’07
Tricks 3: divide into blocks
– and build a tree over each block as shown before
Solution 2: AutoTree SPIRE’06 / JIR’07
Tricks 3: divide into blocks
– and build a tree over each block as shown before
Theorem:– query processing time O(|D| + |output|)
– uses no more space than an inverted index
AutoTree Summary:+ output-sensitive
– not IO-efficient (heavy use of bit-rank operations)
– compression not optimal
99% correlation with
actual running times
Parenthesis
Despite its quadratic worst-case complexity, the inverted index is hard to beat in practice
– very simple code
– lists are highly compressible
– perfect locality of access
Number of operations is a deceptive measure
– 100 disk seeks take about half a second
– in that time can read 200 MB of contiguous data(if stored compressed)
– main memory: 100 non-local accesses 10 KB data block
data
Solution 3: HYB
Flat division of word range into blocks1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A
SIGIR’06 / IR’07
list forA-D
2 2 3 3 4 4 7 7 8 8 9 9 11E F G J H I I E F G H J I
list forE-J
1 1 2 3 4 5 6 6 6 8 9 9 9 10 10L N M N N K L M N M K L M K L
list forK-N
Solution 3: HYB
Flat division of word range into blocks
Replace doc ids by gaps and words by frequency ranks:
1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A
+1 +2 +0 +2 +0 +1 +1 +1 +0 +1 +2 +0 +0 +1 +1 +23rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st
Encode both gaps and ranks such that x log2 x bits+0 0 +1 10 +2 110
1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110
10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0
An actual block of HYB
SIGIR’06 / IR’07
Solution 3: HYB
Flat division of word range into blocks
Theorem:
– Let n = number of documents, m = number of words
– If blocks are chosen of equal volume ~ n
– Then query time ~ n and empiricial entropy HHYB ~ (1+ ε) ∙ HINV
1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A
SIGIR’06 / IR’07
HYB Summary:
+ IO-efficient (mere scans of data)
+ very good compression
– not output-sensitive
experimental results
match perfectly
Conclusion
Context-sensitive prefix search
– core mechanism of the CompleteSearch engine
– simple enough to allow efficient realization
– powerful enough to support many advanced search features
Open problems
– solution which is both output-sensitive and IO-efficient
– implement the whole thing using MapReduce
– support yet more features
– …
Thank you!
Processing the query “beatles musician”
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
entity:john_lennonentity:1964entity:liverpooletc.
entity:wolfang_amadeus_mozartentity:johann_sebastian_bachentity:john_lennonetc.
entity:john_lennonetc.
twoprefix
queries
onejoin
position
beatles entity:* entity:* . relation:is_a .
class:musician
Processing the query “beatles musician”
Problem: entity:* has a huge number of occurrences– ≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences– prefix search efficient only for up to ≈ 1% (explanation follows)
Solution: frontier classes– classes at “appropriate” level in the hierarchy– e.g.: artist, believer, worker, vegetable, animal, …
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
position
beatles entity:* entity:* . relation:is_a .
class:musician
Processing the query “beatles musician”
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
artist:john_lennonartist:graham_greeneartist:pete_bestetc.
artist:wolfang_amadeus_mozartartist:johann_sebastian_bachartist:john_lennonetc.
artist:john_lennonetc.
position
beatles artist:* artist:* . relation:is_a .
class:musiciantwoprefix
queries
onejoin
first figure out:musician artist
(easy)
INV vs. HYB — Space Consumption
Theorem: The empirical entropy of INV is
Σ ni ∙ (1/ln 2 + log2(n/ni))Theorem: The empirical entropy of HYB with block size ε∙n is
Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))
HOMEOPATHY
44,015 docs 263,817 wordswith positions
WIKIPEDIA2,866,503 docs
6,700,119 words
with positions
TREC .GOV25,204,013 docs
25,263,176 words
no positions
raw size 452 MB 7.4 GB 426 GB
INV 13 MB 0.48 GB 4.6 GB
HYB 14 MB 0.51 GB 4.9 GB
Nice match of theory and practice
ni = number of documents containing i-th word, n = number of
documents
INV vs. HYB — Query Time
HOMEOPATHY44,015 docs
263,817 words5,732 real queries
with proximity
avg : 0.03 secsmax: 0.38 secs
avg : .003 secsmax: 0.06 secs
INV
HYB
WIKIPEDIA2,866,503 docs
6,700,119 words100 random queries
with proximity
avg : 0.17 secsmax: 2.27 secs
avg : 0.05 secsmax: 0.49 secs
Experiment: type ordinary queries from left to right
db , dbl , dblp , dblp un , dblp uni , dblp univ , dblp unive , ...
TREC .GOV25,204,013 docs
25,263,176 words50 TREC queries
no proximity
avg : 0.58 secsmax: 16.83 secs
avg : 0.11 secsmax: 0.86 secs
HYB beats INV by an order of magnitude
Engineering
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.
C++ Java MySQL Perl
read decomp. intersect rank history
21% 18% 11% 15% 35%
Engineering
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.
C++ Java MySQL Perl
1800 MB/sec
read decomp. intersect rank history
21% 18% 11% 15% 35%
Engineering
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.
C++ Java MySQL Perl
1800 MB/sec 300 MB/sec
read decomp. intersect rank history
21% 18% 11% 15% 35%
Engineering
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.
C++ Java MySQL Perl
1800 MB/sec 300 MB/sec 16 MB/sec
read decomp. intersect rank history
21% 18% 11% 15% 35%
Engineering
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)
With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.
C++ Java MySQL Perl
1800 MB/sec 300 MB/sec 16 MB/sec 2 MB/sec
read decomp. intersect rank history
21% 18% 11% 15% 35%
System Design — High Level View
Debugging such an application is hell!
Compute ServerC++
Web ServerPHP
User ClientJavaScript