adaptive query processing on raw data

Post on 09-Jul-2015

99 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation of RAW, a prototype query engine which enables querying heterogeneous data sources transparently using Just-In-Time access paths. Presentation given at the 40th International Conference on Very Large Databases (VLDB 2014)

TRANSCRIPT

Adaptive Query Processing on RAW Data

Manos Karpathiotakis, Miguel Branco, Ioannis Alagiannis, Anastasia Ailamaki

Many “data problems” lack solutions

• Petabytes of data

• No a priori knowledge about data

• Multiple file formats

2

Many “data problems” lack solutions

• Petabytes of data

• No a priori knowledge about data

• Multiple file formats

3Need declarative, interactive access to arbitrary datasets

CSV JSONXML

DBMSTr

ansf

orm

QueryRAW: A query engine adapting to data

Raw files as native storage format

RAW: Generate plug-in per file

4

Traditional DBMS: Data adapts to query engine

CSV JSONXML

QueryRAW: A query engine adapting to data

Raw files as native storage format

RAW: Generate plug-in per file

5

Traditional DBMS: Data adapts to query engine

CSV JSONXML

QueryRAW: A query engine adapting to data

Raw files as native storage format

RAW: Generate plug-in per file

6

High-performance querying… while keeping data formats, files, and scripts

Traditional DBMS: Data adapts to query engine

How RAW adapts to data

CSVROOT

Vectorized Query Execution

CSVROOT

join

scanplaceholder

scanplaceholder

filter

How RAW adapts to data

scanplaceh.

SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND …

8

Code Generate the Access Paths

Vectorized Query Execution

CSVROOT

join

filter

How RAW adapts to data

scancsv

scanroot

SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan

csv

9

Code Generate the Access Paths

Build Position and Data Caches

Vectorized Query Execution

CSVROOT

join

filter

How RAW adapts to data

scancsv

scanroot

SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan

csv

10

Code Generate the Access Paths

Build Position and Data Caches

Vectorized Query Execution

CSVROOT

join

filter

How RAW adapts to data

scancsv

scanroot

SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan

csv

11

Adapt to format, file instance and queryjust-in-time

Adapting to schema & query

[CSV input]

∀col:

if col needed:

if col isInt

readInt();

if col isFloat

readFloat();

if ...

else:

skipField();

GENERAL-PURPOSE

readInt();

readInt();

skipField();

readFloat();

skipRestLine();

JUST-IN-TIME

Remove overhead of generic operators

12

Adapting to format

• Unroll Columns

• Free navigation in files

• Embedded indexes/existing APIs

13

Adapting to format

• Unroll Columns

• Free navigation in files

• Embedded indexes/existing APIs

∀col:if col needed:

if col isInt...

readInt();skipField();readFloat();skipRest();

14

Adapting to format

• Unroll Columns

• Free navigation in files

• Embedded indexes/existing APIs

∀col:if col needed:

if col isInt...

readInt();skipField();readFloat();skipRest();

- fieldLength:10- tupleLength:100- Need fields 2 & 5

of 2nd row

moveTo(110);readInt();moveTo(140);readFloat();

15

Adapting to format

• Unroll Columns

• Free navigation in files

• Embedded indexes/existing APIs

∀col:if col needed:

if col isInt...

readInt();skipField();readFloat();skipRest();

- fieldLength:10- tupleLength:100- Need fields 2 & 5

of 2nd row

moveTo(110);readInt();moveTo(140);readFloat();

- Bitmaps, R-Trees etc.- readNextField() vs. readField(filename,id)

16

Ad-Hoc Operators for Raw Data Fine-grained, raw-data-aware decisions

Ad-Hoc Operators for Raw Data

Col9

Scan CSVColumns 1,9

Filter

Tuple Construction

Col1

Col1

JIT – OPTION 1

Col1 Col9

CSV file:SELECT col9 WHERE col1 < [X]

Fine-grained, raw-data-aware decisions

18

JIT – OPTION 2

Col1

Ad-Hoc Operators for Raw Data

Col9

Scan CSVColumns 1,9

Filter

Tuple Construction

Col1

Col1

JIT – OPTION 1

Col1 Col9

CSV file:SELECT col9 WHERE col1 < [X]

Scan CSVColumn 1

Fine-grained, raw-data-aware decisions

19

JIT – OPTION 2

Filter

Col1

Col1

543

12

Ad-Hoc Operators for Raw Data

Col9

Scan CSVColumns 1,9

Filter

Tuple Construction

Col1

Col1

JIT – OPTION 1

Col1 Col9

CSV file:SELECT col9 WHERE col1 < [X]

Scan CSVColumn 1

Fine-grained, raw-data-aware decisions

20

JIT – OPTION 2

Filter

Tuple Construction

Col1 Col9

Col1

Col9

Col1

543

12

Ad-Hoc Operators for Raw Data

Col9

Scan CSVColumns 1,9

Filter

Tuple Construction

Col1

Col1

JIT – OPTION 1

Col1 Col9

CSV file:SELECT col9 WHERE col1 < [X]

Scan CSVColumn 1

Fine-grained, raw-data-aware decisions

Scan CSVColumn 9

21

Processes 3/5 raw col9 fields

Electron

eventID INT

eta FLOAT

pt FLOAT

Jet

eventID INT

eta FLOAT

pt FLOATEvent

eventID INT

runNumber INT

Muon

eventID INT

eta FLOAT

pt FLOAT

ROOT - C++ RAWclass Event {

class Muon {float pt, eta;…

}class Electron {

float pt, eta;…

}class Jet {

float pt, eta;…

}int runNumber;vector<Muon> muons;vector<Electron> electrons;vector<Jet> jets; }

Finding the Higgs Boson: Data

22

Finding the Higgs Boson: Queries“Identify events of interest → Filter out background events

→ Plot aggregated results in a histogram”

SELECT eventFROM root:/data1/ATLAS/*.root ,

csv:/data1/ATLAS/events.csvWHERE( csv.id = event.id AND

event.EF_e24vhi_medium1 OR event.EF_e60_medium1 OR event.EF_2e12Tvh_loose1 OR event.EF_mu24i_tight OR event.EF_mu36_tight OR event.EF_2mu13) ANDevent.muon.mu_ptcone20 < 0.1 * event.muon.mu_pt ANDevent.muon.mu_pt > 20000. ANDABS(event.muon.mu_eta) < 2.4 AND

…..

1000+ lines of C++

for (unsigned int imuon = 0 ; imuon<((*curr_entries)[jentry].mu_pt)->size(); imuon++) {if (((*curr_entries)[jentry].

mu_ptcone20)->at(imuon) < 0.1 * ((*curr_entries)[jentry].mu_pt)->at(imuon) && ((*curr_entries)[jentry].mu_pt)->at(imuon) > 20000. &&fabs(((*curr_entries)[jentry].mu_eta)->at(imuon)) < 2.4 &&…

}...

ROOT - C++ RAW

23

0.1

1

10

100

1000

10000

Query 1(Cold)

Query 2 Query 3 Query 4 Query 5 Query 6

Exe

cuti

on

Tim

e (

sec)

RAW ROOT

RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]

ROOT: 900 GB in 127 files

CSV: 1 “table” of IDs

24

0.1

1

10

100

1000

10000

Query 1(Cold)

Query 2 Query 3 Query 4 Query 5 Query 6

Exe

cuti

on

Tim

e (

sec)

RAW ROOT

RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]

ROOT: 900 GB in 127 files

CSV: 1 “table” of IDs

25Declarative access and up to 90x improvement

Accessing raw data is DBMS business

• Adapt to data and queries

• CERN: 90x

• Application in Banking / Healthcare / Astronomy

http://dias.epfl.ch/RAW

26

Image Sources

http://www.nasa.gov

http://home.web.cern.ch/

http://www.alacergroup.com/

http://www.humanbrainproject.eu

28

top related