space- and time-efficient data structures for massive...

92
Giulio Ermanno Pibiri [email protected] Supervisor Rossano Venturini Department of Computer Science University of Pisa 1 Space- and Time-Efficient Data Structures for Massive Datasets 15/11/2018

Upload: others

Post on 02-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Giulio Ermanno Pibiri [email protected]

Supervisor Rossano Venturini

Department of Computer Science University of Pisa

1

Space- and Time-EfficientData Structures

for Massive Datasets

15/11/2018

Page 2: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor
Page 3: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

3

Evidence

The increase of information does not scale with technology.

Page 4: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

3

Evidence

“Software is getting slower more rapidly than hardware becomes faster.”Niklaus Wirth, A Plea for Lean Software

The increase of information does not scale with technology.

Page 5: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

3

Evidence

“Software is getting slower more rapidly than hardware becomes faster.”Niklaus Wirth, A Plea for Lean Software

The increase of information does not scale with technology.

Even more relevant today!

Page 6: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

4

Scenario

time space

AlgorithmsEFFICIENCY

how much work is required by a program - less work

Data structuresPERFORMANCE

how quickly a program does its work - faster work

Page 7: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

4

Scenario

time space

AlgorithmsEFFICIENCY

how much work is required by a program - less work

Data structuresPERFORMANCE

how quickly a program does its work - faster work

?

Data compression space time

Page 8: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Small vs. fast?

The dichotomy problem

5

Page 9: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Small vs. fast?

The dichotomy problem

5

Choose one.

Page 10: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Small vs. fast?

NO

The dichotomy problem

5

Choose one.

Page 11: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

6

High level thesis

Data Structures + Data Compression Fast Algorithms

Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective,

that support fast data extraction.

Data Compression & Fast Retrieval together.

Page 12: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

7

Achieved resultsJournal paper Clustered Elias-Fano Indexes

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017.

Conference paperGiulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017.

Dynamic Elias-Fano Representation

Conference paperGiulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017.

Efficient Data Structures for Massive N-Gram Datasets

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini arXiv (CoRR), April 2018. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) Full paper, 12 pages, 2018.

Variable-Byte Encoding is Now Space-Efficient Too

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2018. To appear. Full paper, 41 pages, 2018.

Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019.

Fast Dictionary-based Compression for Inverted Indexes

Journal paper

Journal paper

Page 13: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

7

Achieved resultsJournal paper Clustered Elias-Fano Indexes

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017.

Conference paperGiulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017.

Dynamic Elias-Fano Representation

Conference paperGiulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017.

Efficient Data Structures for Massive N-Gram Datasets

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini arXiv (CoRR), April 2018. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) Full paper, 12 pages, 2018.

Variable-Byte Encoding is Now Space-Efficient Too

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2018. To appear. Full paper, 41 pages, 2018.

Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019.

Fast Dictionary-based Compression for Inverted Indexes

Journal paper

Journal paper

integer sequences

Page 14: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

7

Achieved resultsJournal paper Clustered Elias-Fano Indexes

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017.

Conference paperGiulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017.

Dynamic Elias-Fano Representation

Conference paperGiulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017.

Efficient Data Structures for Massive N-Gram Datasets

Conference paper

Giulio Ermanno Pibiri and Rossano Venturini arXiv (CoRR), April 2018. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) Full paper, 12 pages, 2018.

Variable-Byte Encoding is Now Space-Efficient Too

Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2018. To appear. Full paper, 41 pages, 2018.

Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019.

Fast Dictionary-based Compression for Inverted Indexes

Journal paper

Journal paper

integer sequences

short strings

Page 15: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

8

Problem 1

Consider a sorted integer sequence.

Page 16: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

8

Problem 1

Consider a sorted integer sequence.

How to represent it as a bit-vector where each original integer is uniquely-decodable, using as few as possible

bits?

How to maintain fast decompression speed?

Page 17: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

8

Problem 1

Consider a sorted integer sequence.

How to represent it as a bit-vector where each original integer is uniquely-decodable, using as few as possible

bits?

How to maintain fast decompression speed?

This is a difficult problem that has been studied since the the ’60.

Page 18: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

9

Applications

Inverted indexes Databases

RDF indexing

Geo-spatial data Graph-compression

E-Commerce

Page 19: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

9

Applications

Inverted indexes Databases

RDF indexing

Geo-spatial data Graph-compression

E-Commerce

Page 20: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Inverted indexes

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

10

Page 21: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Inverted indexes

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

10

Page 22: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Inverted indexes

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

{always, boy, good, house, hungry, is, red, the}t1 t2 t3 t4 t5 t6 t7 t8

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

10

Page 23: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Inverted indexes

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

21

3

45

{always, boy, good, house, hungry, is, red, the}t1 t2 t3 t4 t5 t6 t7 t8

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

10

Page 24: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Inverted indexes

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

21

3

45

{always, boy, good, house, hungry, is, red, the}t1 t2 t3 t4 t5 t6 t7 t8

Lt1=[1, 3]Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

10

Page 25: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Inverted indexes

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

21

3

45

{always, boy, good, house, hungry, is, red, the}t1 t2 t3 t4 t5 t6 t7 t8

Lt1=[1, 3]Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

10

Page 26: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

11

Inverted indexes

Inverted indexes owe their popularity to the efficient resolution of queries, such as:

“return all documents in which terms {t1,…,tk} occur”.

Page 27: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

{always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

11

Inverted indexes

Inverted indexes owe their popularity to the efficient resolution of queries, such as:

“return all documents in which terms {t1,…,tk} occur”.

Page 28: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

{always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

11

Inverted indexes

Inverted indexes owe their popularity to the efficient resolution of queries, such as:

“return all documents in which terms {t1,…,tk} occur”.

Q = {boy, is, the}

Page 29: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

{always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

11

Inverted indexes

Inverted indexes owe their popularity to the efficient resolution of queries, such as:

“return all documents in which terms {t1,…,tk} occur”.

Q = {boy, is, the}

Page 30: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Huge research corpora describing different space/time trade-offs.

• Elias Gamma and Delta • Variable-Byte Family • Binary Interpolative Coding • Simple Family • PForDelta • QMX • Elias-Fano • Partitioned Elias-Fano

Many solutions

12

‘70

2014

Page 31: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Huge research corpora describing different space/time trade-offs.

• Elias Gamma and Delta • Variable-Byte Family • Binary Interpolative Coding • Simple Family • PForDelta • QMX • Elias-Fano • Partitioned Elias-Fano

Many solutions

12

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary Interpolative

Coding

Variable-ByteFamily

‘70

2014

Page 32: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Huge research corpora describing different space/time trade-offs.

• Elias Gamma and Delta • Variable-Byte Family • Binary Interpolative Coding • Simple Family • PForDelta • QMX • Elias-Fano • Partitioned Elias-Fano

Many solutions

12

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary Interpolative

Coding

Variable-ByteFamily

‘70

2014

Page 33: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

13

Key research questions

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary Interpolative

Coding

Variable-ByteFamily

Page 34: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

13

Key research questions

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary Interpolative

Coding

Variable-ByteFamily

Is it possible to design an encoding that is as small as

BIC and much faster?1

Page 35: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

13

Key research questions

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary Interpolative

Coding

Variable-ByteFamily

Is it possible to design an encoding that is as small as

BIC and much faster?1

Is it possible to design an encoding that is as fast as VByte and much smaller?

2

Page 36: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

13

Key research questions

Space Time

Spectrum

~3X smaller ~4.5X faster

Binary Interpolative

Coding

Variable-ByteFamily

Is it possible to design an encoding that is as small as

BIC and much faster?1

Is it possible to design an encoding that is as fast as VByte and much smaller?

2

What about both objectives at the same time?!

3

Page 37: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

14

Idea 1 - Clustered inverted indexes (TOIS ’17)

Every encoder represents each sequence individually.No exploitation of redundancy.

Page 38: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

14

Idea 1 - Clustered inverted indexes (TOIS ’17)

Every encoder represents each sequence individually.No exploitation of redundancy.

Page 39: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

14

Idea 1 - Clustered inverted indexes (TOIS ’17)

Every encoder represents each sequence individually.No exploitation of redundancy.

Encode clusters of inverted lists.

Page 40: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

14

Idea 1 - Clustered inverted indexes (TOIS ’17)

Every encoder represents each sequence individually.No exploitation of redundancy.

Encode clusters of inverted lists.

Always better than PEF (by up to 11%)and better than BIC

(by up to 6.25%)

Much faster than BIC (~103%)

Slightly slower than PEF (~20%)

Space Time

Spectrum

Page 41: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

15

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Page 42: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

15

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Page 43: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

15

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Encode dense regions with unary codes, sparse

regions with VByte.

Page 44: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

15

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Encode dense regions with unary codes, sparse

regions with VByte.

Optimal partitioning in linear time and

constant space.

Page 45: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

15

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Encode dense regions with unary codes, sparse

regions with VByte.

Compression ratio improves by 2X.

Optimal partitioning in linear time and

constant space.

Page 46: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

15

Idea 2 - Optimally-partitioned VByte (TKDE ’18)

The majority of values are small (very small indeed).

VByte needs at least 8 bits per integer, that is sensibly far away from bit-level effectiveness (BIC: 3.54, PEF: 4.1 on Gov2).

Encode dense regions with unary codes, sparse

regions with VByte.

Compression ratio improves by 2X.

Query processing speed and sequential decoding

not affected.

Optimal partitioning in linear time and

constant space.

Page 47: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Idea 3 - Dictionary compression (WSDM ’19)

16

with M. Petri and A. Moffat (University of Melbourne)If we consider subsequences of d-gaps in inverted lists,

these are repetitive across the whole inverted index.

Page 48: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Idea 3 - Dictionary compression (WSDM ’19)

16

with M. Petri and A. Moffat (University of Melbourne)If we consider subsequences of d-gaps in inverted lists,

these are repetitive across the whole inverted index.

Put the top-k frequent patters in a dictionary of size k.

Then encode inverted lists as sequences of log k-bit codewords.

Page 49: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Idea 3 - Dictionary compression (WSDM ’19)

16

with M. Petri and A. Moffat (University of Melbourne)If we consider subsequences of d-gaps in inverted lists,

these are repetitive across the whole inverted index.

Put the top-k frequent patters in a dictionary of size k.

Then encode inverted lists as sequences of log k-bit codewords.

Close to the most space-efficient representation (~7% away from BIC).

Page 50: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Idea 3 - Dictionary compression (WSDM ’19)

16

with M. Petri and A. Moffat (University of Melbourne)If we consider subsequences of d-gaps in inverted lists,

these are repetitive across the whole inverted index.

Put the top-k frequent patters in a dictionary of size k.

Then encode inverted lists as sequences of log k-bit codewords.

Close to the most space-efficient representation (~7% away from BIC).

Almost as fast as the fastest SIMD-ized decoders.

Page 51: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

The bigger picture

17

Page 52: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

The bigger picture

17

Page 53: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

The bigger picture

17

Page 54: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Integer data structures

• van Emde Boas Trees• X/Y-Fast Tries• Fusion Trees• Exponential Search Trees• …

• EF(S(n,u)) = n log(u/n) + 2n bits to encode a sorted integer sequence S

• O(1) Access• O(1 + log(u/n)) Predecessor

space+ time -

dynamic+space+static-

+ time

Elias-Fano encoding

Problem 2

18

Page 55: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Integer data structures

• van Emde Boas Trees• X/Y-Fast Tries• Fusion Trees• Exponential Search Trees• …

• EF(S(n,u)) = n log(u/n) + 2n bits to encode a sorted integer sequence S

• O(1) Access• O(1 + log(u/n)) Predecessor

space+ time -

dynamic+space+static-

+ time

Can we grab the best from both?

Elias-Fano encoding

Problem 2

18

Page 56: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

19

Dynamic inverted indexes

Classic solution: use two indexes. One is big and cold; the other is small and hot.

Merge them periodically.

Append-only inverted indexes.

Page 57: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

20

For u = nγ, γ = (1):• EF(S(n,u)) + o(n) bits• O(1) Access• O(min{1+log(u/n), loglog n}) Predecessor

Integer dictionaries in succinct space (CPM ’17)

• EF(S(n,u)) + o(n) bits• O(1) Access• O(1) Append (amortized)• O(min{1+log(u/n), loglog n}) Predecessor

• EF(S(n,u)) + o(n) bits• O(log n / loglog n) Access

• O(log n / loglog n) Insert/Delete (amortized)• O(min{1+log(u/n), loglog n}) Predecessor

Result 1

Result 2

Result 3

Page 58: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

20

For u = nγ, γ = (1):• EF(S(n,u)) + o(n) bits• O(1) Access• O(min{1+log(u/n), loglog n}) Predecessor

Integer dictionaries in succinct space (CPM ’17)

• EF(S(n,u)) + o(n) bits• O(1) Access• O(1) Append (amortized)• O(min{1+log(u/n), loglog n}) Predecessor

• EF(S(n,u)) + o(n) bits• O(log n / loglog n) Access

• O(log n / loglog n) Insert/Delete (amortized)• O(min{1+log(u/n), loglog n}) Predecessor

Result 1

Result 2

Result 3

Optimal time bounds for all

operations using a sublunar

redundancy.

Page 59: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

21

Problem 3

Consider a large text.

Page 60: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

21

Problem 3

Consider a large text.

How to represent all its substrings of size 1 ≤ k ≤ N words for fixed N (e.g., N = 5), using as few as possible bits?

How to estimate the probability of occurrence of the patterns under a given probability model?

Fast Access to individual N-grams?

Page 61: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

21

Problem 3

Consider a large text.

How to represent all its substrings of size 1 ≤ k ≤ N words for fixed N (e.g., N = 5), using as few as possible bits?

How to estimate the probability of occurrence of the patterns under a given probability model?

Fast Access to individual N-grams?

This is problem is central to applications in IR, ML, NLP, WSE.

Page 62: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

22

Applications

Next word prediction.

Page 63: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

22

Applications

Next word prediction.

space and time-efficient ?

context

Page 64: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

22

Applications

Next word prediction.

algorithms

foo

data structures

bar

baz

1214

2

3647

3

1

frequency count

space and time-efficient ?

context

Page 65: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

22

Applications

Next word prediction.

algorithms

foo

data structures

bar

baz

1214

2

3647

3

1

frequency count

space and time-efficient ?

context

f (“space and time-efficient data structures”)

f (“space and time-efficient”)P(“data structures” | “space and time-efficient”) ≈

Page 66: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

What can I help you with?

Siri

Page 67: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

24

Applications

Page 68: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

24

Applications

Page 69: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

Indexing

25

Books~6% of the books ever published

n number of n-grams

1 24,359,4732 667,284,7713 7,397,041,9014 1,644,807,8965 1,415,355,596

More than 11 billion n-grams.

Page 70: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

26

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

Page 71: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

k = 1

Map a word ID to the position it takes within its sibling IDs

(the IDs following a context of fixed length k).

26

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

Page 72: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

k = 1

Map a word ID to the position it takes within its sibling IDs

(the IDs following a context of fixed length k).

26

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

Page 73: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

k = 1

Map a word ID to the position it takes within its sibling IDs

(the IDs following a context of fixed length k).

26

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

Page 74: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

k = 1

Map a word ID to the position it takes within its sibling IDs

(the IDs following a context of fixed length k).

26

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

Page 75: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

k = 1

Map a word ID to the position it takes within its sibling IDs

(the IDs following a context of fixed length k).

26

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

The (Elias-Fano) context-based remapped trie is as fast as the fastest competitor, but up to 65% smaller.

Page 76: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

k = 1

Map a word ID to the position it takes within its sibling IDs

(the IDs following a context of fixed length k).

26

Idea 1 - Context-based remapped tries (SIGIR ’17)

The number of words following a given context is small.

The (Elias-Fano) context-based remapped trie is even smaller than

the most space-efficient competitors, that are lossy and with false-positives

allowed, and up to 5X faster.

The (Elias-Fano) context-based remapped trie is as fast as the fastest competitor, but up to 65% smaller.

Page 77: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Page 78: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Page 79: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Page 80: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Page 81: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Page 82: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Page 83: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block and O(|V|) space.

Page 84: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block and O(|V|) space.

Page 85: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block and O(|V|) space.

Rebuilding the last level of the trie.

Page 86: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block and O(|V|) space.

Rebuilding the last level of the trie.

A 4B 2C 2X 4

Page 87: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block and O(|V|) space.

Rebuilding the last level of the trie.

A 4B 2C 2X 4

A 1B 5C 7X 9

Page 88: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block and O(|V|) space.

Rebuilding the last level of the trie.

A 4B 2C 2X 4

A 1B 5C 7X 9

Page 89: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block and O(|V|) space.

Rebuilding the last level of the trie.

A 4B 2C 2X 4

A 1B 5C 7X 9

Page 90: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block and O(|V|) space.

Rebuilding the last level of the trie.

A 4B 2C 2X 4

A 1B 5C 7X 9

Page 91: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

27

Idea 2 - Fast estimation in external memory (TOIS ’18)

To compute the modified Kneser-Ney probabilities of the n-grams,

the fastest algorithm in the literature uses 3 sorting steps in external memory.

Suffix order Context order

Computing the distinct left extensions.

Using a scan of the block and O(|V|) space.

Rebuilding the last level of the trie.

A 4B 2C 2X 4

A 1B 5C 7X 9

Estimation runs 4.5X faster with billions of strings.

Page 92: Space- and Time-Efficient Data Structures for Massive Datasetspages.di.unipi.it/pibiri/slides/final_PhD_presentation.pdf · Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor

28

Thanks for your attention, time, patience!

Any questions?