data science at scale - prashant pandeyapocrypha 4. hyrise 5. a data security startup theoretically...

123
Data Science at Scale: Scaling Up by Scaling Down and Out (to Disk) Prashant Pandey [email protected] Berkeley Lab/UC Berkeley 1

Upload: others

Post on 15-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Data Science at Scale:Scaling Up by Scaling Down and Out (to Disk)

Prashant [email protected]

Berkeley Lab/UC Berkeley

1

Page 2: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Sequence Read Archive (SRA) database growth

2

Total bases

Open access

Log

-sca

le

SRA contains a lot of diversity informationGoal: perform sequence searches on the database

Page 3: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Scalability is the bottleneck for data science

3

Total bases

Open access

Log

-sca

le

Data science applications only looking at a small portion of data

Current index size (few TBs)

Page 4: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Scalable data systems → Scalable data science

4

Total bases

Open access

Log

-sca

le

Goal: index data at Exa Byte scale

My goal as a researcher is to build scalable data systems to accelerate and scale data science applications

Current index size (few TBs)

Page 5: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Three approaches to handle massive data

5

Page 6: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Goal: make data smaller to fit in RAM

Techniques:● Compact &

succinct data structures

● Filters, e.g., Bloom, quotient, etc.

Three approaches to handle massive data

Shrink it

6

Page 7: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Goal: organize data in a disk-friendly way

Techniques:● B-tree● Bε-tree● LSM-tree

Goal: make data smaller to fit in RAM

Techniques:● Compact &

succinct data structures

● Filters, e.g., Bloom, quotient, etc.

Three approaches to handle massive data

Shrink it Organize it

7

Page 8: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Goal: partition and distribute data on multiple nodes

Techniques:● Distributed

hash table● Distributed

key-value store

Goal: organize data in a disk-friendly way

Techniques:● B-tree● Bε-tree● LSM-tree

Goal: make data smaller to fit in RAM

Techniques:● Compact &

succinct data structures

● Filters, e.g., Bloom, quotient, etc.

Three approaches to handle massive data

Shrink it Organize it Distribute it

8

Page 9: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Research outputData structures & Algorithms

(Counting) Quotient FilterSIGMOD ‘17,

arXiv ‘17

Buffered Count-Min

SketchESA ‘18

Order Min HashISMB ‘19

9

Page 10: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Research outputData structures & Algorithms

Buffered Count-Min

SketchESA ‘18

Order Min HashISMB ‘19

BεtrFS file systemFAST ‘15, TOS 15,

FAST ‘16, TOS 16, SPAA ‘19

Bε-tree

File systems

(Counting) Quotient FilterSIGMOD ‘17,

arXiv ‘17

10

Page 11: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Research outputData structures & Algorithms

Buffered Count-Min

SketchESA ‘18

Order Min HashISMB ‘19

BεtrFS file systemFAST ‘15, TOS 15,

FAST ‘16, TOS 16, SPAA ‘19

Bε-tree

Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis

ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,

RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,

JCB ‘20

LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21

Distributed k-mer countingIPDPS ‘21

File systemsComputational biology

(Counting) Quotient FilterSIGMOD ‘17,

arXiv ‘17

11

Page 12: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Research outputData structures & Algorithms

Buffered Count-Min

SketchESA ‘18

Order Min HashISMB ‘19

BεtrFS file systemFAST ‘15, TOS 15,

FAST ‘16, TOS 16, SPAA ‘19

Bε-tree

Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis

ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,

RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,

JCB ‘20

LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21

Distributed k-mer countingIPDPS ‘21

Stream processing

LERTsarXiv ‘19, SIGMOD ‘20

File systemsComputational biology

(Counting) Quotient FilterSIGMOD ‘17,

arXiv ‘17

12

Page 13: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

In this talkData structures & Algorithms

Buffered Count-Min

SketchESA ‘18

Order Min HashISMB ‘19

BεtrFS file systemFAST ‘15, TOS 15,

FAST ‘16, TOS 16, SPAA ‘19

Bε-tree

Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis

ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,

RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,

JCB ‘20

LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21

Distributed k-mer countingIPDPS ‘21

Stream processing

LERTsarXiv ‘19, SIGMOD ‘20

File systemsComputational biology

(Counting) Quotient FilterSIGMOD ‘17,

arXiv ‘17

Shrink it

13

Page 14: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

In this talkData structures & Algorithms

Buffered Count-Min

SketchESA ‘18

Order Min HashISMB ‘19

BεtrFS file systemFAST ‘15, TOS 15,

FAST ‘16, TOS 16, SPAA ‘19

Bε-tree

Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis

ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,

RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,

JCB ‘20

LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21

Distributed k-mer countingIPDPS ‘21

Stream processing

LERTsarXiv ‘19, SIGMOD ‘20

File systemsComputational biology

(Counting) Quotient FilterSIGMOD ‘17,

arXiv ‘17

14

Shrink it

Organize it

Page 15: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Dictionary data structure

ac

b

d

A dictionary maintains a set S from universe U.

A dictionary supports membership queries on S.

membership(a):

membership(b):

membership(c):

membership(d):

S

15

Page 16: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Filter data structure

ac

b

d

A filter is an approximate dictionary.

A filter supports approximate membership queries on S.

membership(a):

membership(b):

membership(c):

membership(d): false positive

S

16

Page 17: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

A filter guarantees a false-positive rate ε

if q ∈ S, return with probability 1

with probability ﹥ 1 - ε if q ∉ S, return with probability ≤ ε false positive

true negative

true positive

one-sided errors

17

Page 18: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

False-positive rate enables filters to be compact

DictionaryFilter

18

Page 19: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

False-positive rate enables filters to be compact

DictionaryFilter

Small

Large

For most practical purposes: ε = 2%, Bloom filter requires ≈ 8 bits/item

19

Page 20: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Classic filter: The Bloom filter [Bloom ‘70]

Bloom filter: a bit array + k hash functions

ac b

d

S

20

Page 21: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Classic filter: The Bloom filter [Bloom ‘70]

Bloom filter: a bit array + k hash functions (here k = 2)

ac b

d

S

h1(a) = 1h2(a) = 3

h1(c) = 5h2(c) = 3

21

Page 22: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Classic filter: The Bloom filter [Bloom ‘70]

Bloom filter: a bit array + k hash functions (here k=2)

ac b

d

S

h1(b) = 2h2(b) = 5 true

negative

22

Page 23: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Classic filter: The Bloom filter [Bloom ‘70]

Bloom filter: a bit array + k hash functions (here k=2)

ac b

d

S

h1(d) = 1h2(d) = 3 False

positive

23

Page 24: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Bloom filter are ubiquitous (> 4300 citations)

Storage systems

NetworkingStreaming applications

Computational biology

Databases

24

Page 25: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Bloom filter have suboptimal asymptotics

Bloom filter Optimal

Space

CPU cost

Data locality

25

Page 26: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Limitations Workarounds

No deletes Rebuild

No resizes Guess N, and rebuild if wrong

No filter merging or enumeration ???

No values associated with keys Combine with another data structure

Application often work around Bloom filter limitations

Bloom filter limitations increase system complexity, waste space, and slow down application performance

26

Page 27: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● Store fingerprints compactly in a hash table.○ Take a fingerprint h(x) for each element x.

● Only source of false positives:○ Two distinct elements x and y, where h(x) = h(y)○ If x is stored and y isn’t, query(y) gives a false positives

Quotienting is an alternative to Bloom filters [Knuth. Searching and Sorting Vol. 3, ‘97]

p

27

Page 28: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

b(x) = location in the hash tablet(x) = tag stored in the hash table

q rb(x)

t(x)

TagBucket index

Storing fingerprints compactly

p

28

Page 29: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

b(x) = location in the hash tablet(x) = tag stored in the hash table

Collisions in the hash table?

b(x)

t(x)

b(y)

t(y)

Storing fingerprints compactly

q r

TagBucket index

p

29

Page 30: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

b(x) = location in the hash tablet(x) = tag stored in the hash table

Collisions in the hash table?● Linear probing● Robin Hood hashing

b(x)

t(x)

t(y)

b(y)

t(y)

Storing fingerprints compactly

q r

TagBucket index

p

30

Page 31: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

b(x) = location in the hash tablet(x) = tag stored in the hash table

Collisions in the hash table?● Linear probing● Robin Hood hashing

b(x)

t(x)

t(y)

b(y)

t(y)

Storing fingerprints compactly

q r

TagBucket index

p

t(y) belongs to slots 4 or 5?

31

Page 32: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● QF uses two metadata bits to resolve collisions and identify home bucket

● The metadata bits group tags by their home bucket

Resolving collisions in the QF [Bender ‘12, Pandey ‘17]

1 1

t(u) t(v) t(w) t(x) t(y)

32

Page 33: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● QF uses two metadata bits to resolve collisions and identify home bucket

● The metadata bits group tags by their home bucket

insert v

Resolving collisions in the QF [Bender ‘12, Pandey ‘17]

1 1

t(u) t(v) t(v) t(w) t(x) t(y)

33

Page 34: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● QF uses two metadata bits to resolve collisions and identify home bucket

● The metadata bits group tags by their home bucket

The metadata bits enable us to identify the slots holding the contents of each bucket.

Resolving collisions in the QF [Bender ‘12, Pandey ‘17]

insert v

1 1

t(u) t(v) t(v) t(w) t(x) t(y)

34More

Page 35: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● Good cache locality● Efficient scaling out-of-RAM● Deletions● Enumerability/Mergeability● Resizing● Maintains count estimates● Uses variable-sized encoding for counts [Counting quotient filter]

○ Asymptotically optimal space: O(∑ |C(x)|)

Quotienting enables many features in the QF

35

Page 36: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Quotient filters use less space than Bloom filters for all practical configurations

Quotient filter Bloom filter Optimal

Space

CPU cost

Data locality

The quotient filter has theoretical advantages over the Bloom filter

36

Page 37: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Bloom filter: ~1.44 log(1/ε) bits/element.Quotient filter: ~2.125 + log(1/ε) bits/element.

Quotient filters use less space than Bloom filters for all practical configurations

False-positive rate < 1/64 (or 0.15).

37

Accuracy

Page 38: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● Insert performance is similar to the state-of-the-art non-counting filters● Query performance is significantly fast at low load-factors and slightly slower

at higher load-factors

Inserts Lookups

Quotient filters perform better (or similar) to other non-counting filters

38

Page 39: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Quotient filter’s impact in computer science

39

Computational biology1. Squeakr2. deBGR3. Mantis4. SPAdes assembler5. Khmer software6. MQF7. VariantStore

Databases/Systems1. Counting on GPUs2. Concurrent filters3. Anomaly detection4. BetrFS file system

Industry1. VMware2. Nutanix3. Apocrypha4. Hyrise5. A data security

startup

Theoretically well-founded data structures can have a big impact on multiple subfields across academia and industry

QFDatabases

SequenceSearchIndex

StreamAnalysis

Key-valueStores

QF on GPUs

Deduplication

Graph represen-

tationAssembler(SPAdes)

Page 40: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Learned “Shrink it”. Now “Organize it”Data structures & Algorithms

Buffered Count-Min

SketchESA ‘18

Order Min HashISMB ‘19

BεtrFS file systemFAST ‘15, TOS 15,

FAST ‘16, TOS 16, SPAA ‘19

Bε-tree

Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis

ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,

RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,

JCB ‘20

LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21

Distributed k-mer countingIPDPS ‘21

Stream processing

LERTsarXiv ‘19, SIGMOD ‘20

File systemsComputational biology

(Counting) Quotient FilterSIGMOD ‘17,

arXiv ‘17

40

Organize it

Page 41: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Open problem in stream processing

● A high-speed stream of key-value pairs arriving over time

● Goal: report every key as soon as it appears T times without

missing any

● Firehose benchmark (Sandia National Lab) simulates the stream https://firehose.sandia.gov/

41

Page 42: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Why should we care about this problem

Defense systems for cyber security monitor high-speed streams for malicious traffic

Malicious traffic forms a small portion of the stream

Automated systems take defensive actions for every reported event

Catch all malicious events

Small reporting threshold

Minimize false positives

42

Page 43: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Timely event detection problem

● Stream of elements arrive over time

S1

Time

S2 St

43

Page 44: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● Stream of elements arrive over time● An event occurs at time t if St occurs exactly T times in

(s1,s2….st)

S1

Time

S2 St

t

Timely event detection problem

44

Page 45: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● Stream of elements arrive over time● An event occurs at time t if St occurs exactly T times in

(s1,s2….st)

S1

Time

S2 St

t

Event!

Suppose T= 4

Timely event detection problem

45

Page 46: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● Stream of elements arrive over time● An event occurs at time t if St occurs exactly T times in

(s1,s2….st)● In timely event-detection problem (TED), we want to report

all events shortly after they occur.

S1

Time

S2 St

t

Event!

Suppose T= 4

Report

Timely event detection problem

46

Page 47: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Features we need in the solution

● Stream is large (e.g., terabytes) and high-speed (millions/sec)

High throughput ingestion

47

Page 48: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Features we need in the solution

● Stream is large (e.g., terabytes) and high-speed (millions/sec)

● Events are high-consequence real-life events

High throughput ingestion

No false-negatives; few false-positives

Timely reporting (real-time)

Sampling

48

Page 49: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Features we need in the solution

● Stream is large (e.g., terabytes) and high-speed (millions/sec)

● Events are high-consequence real-life events

● Malicious traffic forms a small portion of the stream

High throughput ingestion

No false-negatives; few false-positives

Timely reporting (real-time)

Very small reporting thresholds

Sampling

49

Page 50: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

One-pass streaming has errors

● Heavy hitter problem: φ ● Exact one-pass solution solution requires

RAM

50

Page 51: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

One-pass streaming has errors

● φ< (φ−ε)N [Alon et al. 96, Berinde et al. 10, Bhattacharyya et al. 16, Bose et al. 03, Braverman et al.

16, Charikar et al. 02, Cormode et al. 05, Demaine et al. 02, Dimitropoulos et al. 08, Larsen et al. 16, Manku et al. 02.]

● ε

RAM

Real time with false-positives!

Maintain count estimates in RAM[Misra & Gries ‘82]

51

Page 52: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

One-pass streaming has errors

● φ< (φ−ε)N [Alon et al. 96, Berinde et al. 10, Bhattacharyya et al. 16, Bose et al. 03, Braverman et al.

16, Charikar et al. 02, Cormode et al. 05, Demaine et al. 02, Dimitropoulos et al. 08, Larsen et al. 16, Manku et al. 02.]

● ε

RAM

Real time with false-positives!

Maintain count estimates in RAM[Misra & Gries ‘82]

For Sandia, φN is a small constant (e.g., 24), So Ω(1/ε) is very very large!!

Can’t solve in RAM for very small φ

52

Page 53: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

One-pass solution has:

● Stream is large (e.g., terabytes) and high-speed (millions/sec)

● Events are high-consequence real-life events

● Malicious traffic forms a small portion of the stream

High throughput ingestion

No false-negatives; few false-positives

Timely reporting (real-time)

Very small reporting thresholds

53

Page 54: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Two-pass streaming isn’t real-time

● A second pass over the stream can get rid of errors● Store the stream on SSD and access it later

RAM

Scales to very small φ but offline!

Second pass

SSD

54

Page 55: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Two-pass solution has:

● Stream is large (e.g., terabytes) and high-speed (millions/sec)

● Events are high-consequence real-life events

● Malicious traffic forms a small portion of the stream

High throughput ingestion

No false-negatives; few false-positives

Timely reporting (real-time)

Very small reporting thresholds

55

Page 56: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

If data is stored: why not access it?

RAM

SSD

Why wait for second pass?

56

Page 57: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Idea: combine Streaming and EM

Use an efficient external-memory counting data structure to scale Misra-Gries algorithm to SSDs

57

Streaming model

External memory algorithms

Page 58: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● How computations work:○ Data is transferred in blocks between RAM and disk.

○ The number of block transfers dominate the running time.

● Goal: Minimize number of block transfers○ Performance bounds are parameterized by block size B, memory size M,

data size N.

RAM DISK

M

B

B

External memory model [Aggarwal+Vitter ‘08]

58

Page 59: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

L

0

1

RAM

FLASH

log(N/M)

N

Counting QFM

Cascade filter: write-optimized quotient filter[Bender et al. ‘12, Pandey et al. ‘17]

Mr1

MrL

● The Cascade filter efficiently scales out-of-RAM● It accelerates insertions at some cost to queries

59More

Page 60: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Cascade filter operations

Insert Query

60

Page 61: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Cascade filter operations

Insert Query

< 1 I/O per observation

61

Page 62: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Cascade filter operations

Insert Query

< 1 I/O per observation

> 1 I/O per observation

62

Page 63: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Cascade filter doesn’t have real-time reporting

Insert Query

< 1 I/O per observation

> 1 I/O per observation

But every insert is also a query in real-time reporting!

63

Page 64: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Cascade filter doesn’t have real-time reporting

Insert Query

< 1 I/O per observation

> 1 I/O per observation

But every insert is also a query in real-time reporting!

Traditional cascade filter doesn’t solve the problem!

64

Page 65: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

We define the time stretch of a report to be

1st occurrence Tth occurrence Reporting time

Timeline LDLifetime

Delay

Time stretch = 1 + 𝛼 = 1 + Delay

Lifetime

Idea: reporting with bounded delay

65More

Page 66: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

We define the time stretch of a report to be

1st occurrence Tth occurrence Reporting time

Timeline LDLifetime

Delay

Time stretch = 1 + 𝛼 = 1 + Delay

Lifetime

Idea: reporting with bounded delay

Main idea: the longer the lifetime of an item, the more leeway we have in reporting it

66More

Page 67: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Leveled External-Memory Reporting Table (LERT) [Pandey ‘20]

● Given a stream of size N and φ > the amortized cost of solving real-time event detection is

● For a constant 𝛼, can support arbitrarily small thresholds φ with amortized cost

Takeaway: Online reporting comes at the cost of throughput but almost online reporting is essentially free! 67

Page 68: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Leveled External-Memory Reporting Table (LERT) [Pandey ‘20]

● Given a stream of size N and φ > the amortized cost of solving real-time event detection is

● For a constant 𝛼, can support arbitrarily small thresholds φ with amortized cost

Takeaway: Online reporting comes at the cost of throughput but almost online reporting is essentially free!

Can achieve timely reporting at effectively the optimal insert cost; no query cost

68

Page 69: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Evaluation

● Empirical timeliness

● High-throughput ingestion

69

Page 70: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Evaluation: empirical time stretch

Average time stretch is 43% smaller than theoretical upper bound.70

Page 71: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Evaluation: scalability

The insertion throughput increases as we add more threads.We can achieve > 13M insertions/sec.

71

Page 72: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

LERT: supports scalable and real-time reporting

● Stream is large (e.g., terabytes) and high-speed (millions/sec)

● Events are high-consequence real-life events

● Malicious traffic forms a small portion of the stream

High throughput ingestion

No false-negatives; few false-positives

Timely reporting (real-time)

Very small reporting thresholds

72

Page 73: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Future work overview

73

Data structures & Algorithms

Scalable Data Systems

Data Science

Page 74: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Future work: Data Structures & Algorithms

Goal: Overcome decades-old data structure trade-offs using modern

hardware and new algorithmic paradigms

74

Page 75: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Trade-off 1: Insertion throughput degrades with load factor

Insertion throughput vs load factor of state-of-the-art filters

Many update-intensive applications (e.g., network caches, data analytics, etc.) maintain filters at high load factors

75

Page 76: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

16X drop4X

drop

Performance suffers due to high-overhead of collision resolution

76

Trade-off 1: Insertion throughput degrades with load factor

Page 77: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Combining techniques + new hardware

Combining hashing techniques (Robin Hood + 2-choice hashing)Using ultra-wide vector operations (AVX512-BW)

77

Page 78: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Combining techniques + new hardware

Combining hashing techniques (Robin Hood + 2-choice hashing)Using ultra-wide vector operations (AVX512-BW)

78

Constant high performance from

empty to full

More

Page 79: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Future work: Data Systems

Goal: Build a population-scale index on variation data to enable downstream apps

gain quick insights into variants

79

Page 80: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Country-scale sequencing efforts produce huge amounts of sequencing data

AAACGCCCGTACTT

AAACGCCCGTACTTA

CGTACTTA

AAACGCCCGTT

CCCGTACTTA

AAACGCCCGTACTTA

AAACGCCCGTACTT

AAAGTACTTA

AAACGCCCTA

CCCGTACTTA

Assembled data

Individuals

Raw sequencing data

Assembly…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATG

AGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….

Variant calling

Genomic variantsVariant call

Format (VCF)

Sequencing

● 1000 Genomes project [https://www.internationalgenome.org/]

● The Cancer Genome Atlas (TCGA) [https://portal.gdc.cancer.gov/]

● Genotype-Tissue Expression (GTEx) [https://gtexportal.org/home/]80

Page 81: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Variation data analysis can improve downstream applications● Population-level disease analysis

● Genome-wide association studies

● Personalized medicine

● Cancer remission-rate prediction

● Colocalization analysis

● PCR primer design

● Genome assembly

AAACGCCCGTACTTA

AAACGCCCGTACTTA

AAACGCCCGTACTTA

AAACGCCCGTACTTA

AAACGCCCGTACTTA

AAACGCCCGTACTTA

AAACGCCCGTACTTA

AAACGCCCGTACTTA

AAACGCCCGTACTTA

AAACGCCCGTACTTA

Sequencing & assembly

IndividualsPopulation Genomes

Return all positions with variants in a gene

List all people, with sequence S in a gene

Count the number of variants in a gene

List all people, with > N variants in a geneFor person P, return

the closest variant from position X

81

Page 82: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Indexing in multiple coordinates is challengingReference-only indexes map positions only in the reference coordinate system

......

Pan-genome analysis involves queries based on sample coordinate systems

Num Samples

Maintaining thousands of mappings increases computational complexity and memory footprintLimits scalability to population-scale data

82

Page 83: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Indexing in multiple coordinates is challengingReference-only indexes map positions only in the reference coordinate system

......

Pan-genome analysis involves queries based on sample coordinate systems

Num Samples

Maintaining thousands of mappings increases computational complexity and memory footprintLimits scalability to population-scale data

Existing systems don’t support multiple coordinate systems. The ones that do, don’t scale beyond a few

thousand samples.

83

Page 84: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

An inverted index on the pan-genome graph

● Partition the variation graph based on coordinate ranges

● Store partitions on disk

● Succinct index for reference coordinate system

● Local-graph exploration to map position from reference to sample coordinate

Position index

Variation graph

Queries often require loading 1-2 partitions

84

Page 85: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Future work: Data Science for genomics

Goal: Classification of metagenomic reads and identification of novel species using

graph neural networks (GNN)

85

Page 86: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Metagenomic classification pipeline

86

Binning

Profiling

[Ye et al. 2019]

Page 87: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Existing techniques offer low recall

87

Low recall

Classification is done based only on the read contents

Page 88: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Use overlap relationship between reads

88

Metagenomic reads

Overlap graphAssign taxonomic

labels to reads

Minimap2Node features:

Tetra Nucleotide freq,

GC bias, andTaxonomic embedding

Semi-supervised Learning GNN

Taxonomic binning of reads

● Generate overlap graph: reads→nodes & overlap →edges● Node features →Tetra nucleotide freq of reads● Reference-based mapping as ground truth labels

Page 89: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Overlap graph + graph neural network (GNN)

89

Higher recall

Can achieve high recall using graph learning

Page 90: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

https://prashantpandey.github.io

● Scalability of data management systems will be the biggest challenge in future

● Changing hardware give riseto new algorithmic paradigms

Conclusion

We need to redesign existing data structures to take full advantage of modern hardware and rebuild data systems to efficiently support future data science.

Data Science at Scale

ML Genomics Cyber Sec. NLP

Data Systems

Scale down Scale to disk Scale out

Modern hardwareVector inst. GPU NVM SSD

Data structures & Algorithms

90

Page 91: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

91

Page 92: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

92

Page 93: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Backup slides

93

Page 94: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Implementation:2 Meta-bits per slot.

h(x) --> h0(x) || h1(x)

2q

occupieds

runends

2qAbstract Representation

Quotient filter design

Page 95: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Implementation:2 Meta-bits per slot.

h(x) --> h0(x) || h1(x)

2q

occupieds

runends

2qAbstract Representation

Quotient filter design

Page 96: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Implementation:2 Meta-bits per slot.

h(x) --> h0(x) || h1(x)

2q

occupieds

runends

2qAbstract Representation

Quotient filter design

Page 97: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Implementation:2 Meta-bits per slot.

h(x) --> h0(x) || h1(x)

2q

occupieds

runends

2qAbstract Representation

Quotient filter design

Page 98: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Implementation:2 Meta-bits per slot.

h(x) --> h0(x) || h1(x)

2q

occupieds

runends

2qAbstract Representation

Quotient filter design

Page 99: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Implementation:2 Meta-bits per slot.

h(x) --> h0(x) || h1(x)

2q

occupieds

runends

2qAbstract Representation

Quotient filter design

Back

Page 100: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

● Quotient filters store h(x) exactly

● To store x exactly, use an invertible hash function

● For n elements and p-bit hash function:

Space usage: ~p-log2n bits/element

Quotient filters can also be exact

100

Page 101: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Cascade filter: write-optimized quotient filter[Bender et al. ‘12, Pandey et al. ‘17]

Mr1

MrL

Efficient merging

101

● The Cascade filter efficiently scales out-of-RAM● It accelerates insertions at some cost to queries

Page 102: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

L

0

1

RAM

FLASH

log(N/M)

Items are initially inserted in the RAM level

N

Quotient filterM

Mr1

MrL

Efficient merging

Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]

102

Page 103: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Mr1

MrL

Efficient merging

Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]

103

When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1

Page 104: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Mr1

MrL

Efficient merging

Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]

104

When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1

Page 105: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Mr1

MrL

Efficient merging

Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]

105

When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1

Page 106: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Mr1

MrL

Efficient merging

Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]

106

When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1

Page 107: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Mr1

MrL

Efficient merging

Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]

107

When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1

Page 108: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

L

0

1

RAM

FLASH

log(N/M)

N

Query (x)M

Mr1

MrL

Cascade filter: query[Bender et al. ‘12, Pandey et al. ‘17]

108

A query operation requires a lookup in each non-empty levelBack

Page 109: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Time-stretch LERT

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Divide each level into 1+ 1/𝛼, equal-sized bins.

Mr1

MrL

109

Page 110: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Time-stretch LERT

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

When a bin is full, items move to the adjacent bin

Mr1

MrL

110

Page 111: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Time-stretch LERT

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

When a bin is full, items move to the adjacent bin

Mr1

MrL

111

Page 112: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Time-stretch LERT

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Last bin flushed to first bin of the next level

Mr1

MrL

112

Page 113: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Time-stretch LERT

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Last bin flushed to first bin of the next level

While flushing consolidate counts; report if hits threshold

Mr1

MrL

113

Page 114: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Time-stretch LERT

L

0

1

RAM

FLASH

log(N/M)

N

Quotient filterM

Last bin flushed to first bin of the next level

While flushing consolidate counts; report if hits threshold

Mr1

MrL

Main idea: item is not put on a deeper level until it’s “aged sufficiently”

114

Page 115: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Time-stretch LERT I/O complexity

Optimal insert cost for Write-optimized data

structure

115

Page 116: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Time-stretch LERT I/O complexity

Extra cost because we only move one bin during a flush. Constant loss for

constant 𝛼

Optimal insert cost for Write-optimized data

structure

116Back

Page 117: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Trade-off 2: “One-size-fits-all” approach leaves performance on table

LIGRA [Shun & Blelloch ‘13] ASPEN [Dhulipala et al.

‘19]

LIGRA ASPEN

add_edge

get_neighbors

Static Dynamic

Neighbor access requires at least two cache missesFor dynamic, all operations have a log factor

117

Page 118: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Trade-off 2: “One-size-fits-all” approach leaves performance on table

Compressed Sparse Rows [Shun & Blelloch ‘13]

C-tree [Dhulipala et al.

‘19]

LIGRA ASPEN

add_edge

get_neighbors

Static Dynamic

Neighbor access requires at least two cache missesFor dynamic, all operations have a log factor

118

Static → Fast computations; no updatesDynamic → Slower computations; updates

Page 119: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Real world graphs are often skewed

High variance in the degree distribution

● Dynamic partitioning of vertices based on the degree

● Separate structures for each partition to minimize cache misses

119

Page 120: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Dynamic partitioning + hierarchical structure

High variance in the degree distribution

● Dynamic partitioning of vertices based on the degree

● Separate structures for each partition to minimize cache misses

120

Page 121: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Dynamic partitioning + hierarchical structure

High variance in the degree distribution

● Dynamic partitioning of vertices based on the degree

● Separate structures for each partition to minimize cache misses

121

Terrace: Fast updates Terrace:

Faster computations

Back

Page 122: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Scalable data systems → Scalable data science

122

BiologyAstroPhysics

ChemistryCyber monitoringInternet of Things

Environmental science.....

Massive data Data Science

Machine LearningRaw sequence search

Graph analyticsCyber monitoring

Weather predictionsPersonalized medicine

.

.

.

.

.

Data systems

My goal as a researcher is to build scalable data systems to accelerate and scale data science applications

Page 123: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across

Our contribution

Combine streaming and EM algorithms to solve real-time event detection problem

Streaming model

External memory algorithms

123