class 5 column stores 2 - harvard university

35
column stores 2.0 prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ class 5

Upload: others

Post on 19-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: class 5 column stores 2 - Harvard University

column stores 2.0prof. Stratos Idreos

HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/

class 5

Page 2: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /282

what just happened?where is my data?

email, cloud, social media, …

can we design systems that let us know what is going on?

worth thinking about…

Page 3: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /283

cool papers 2.0

The Case for RodentStore: An Adaptive, Declarative Storage SystemPhilippe Cudré-Mauroux, Eugene Wu, Samuel Madden In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2009

Abstraction Without Regret in Database Systems Building: a ManifestoChristoph KochIEEE Data Eng. Bull. 37(1): 70-79 (2014)

dbTouch: Analytics at your FingertipsStratos Idreos and Erietta Liarou In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2013

Page 4: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /284

design doc think, design, create 1-2 page PDF doc and ask for feedback mandatory M1-M3, optional afterwards

submit through Canvas

do not worry about perfection: fail fast wrong ideas ok if you eventually find out they are wrong :) (holds for midterms as well)

Page 5: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /285

Jim Gray, IBM, Tandem, DEC, Microsoft ACM Turing award ACM SIGMOD Edgar F. Codd Innovations Award

disk100Kx Pluto

2 years

memory100x New York1.5 hours

on board cache10x this building

10 min

on chip cache2x this room

1 min

registers my head~0

Page 6: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /28

the way we store data defines the possible (efficient) access methods

6

Page 7: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /287

free_offset, N, offset1-length1, offset2-lenght2,…

free space

slotted page

scan null

update var length

Page 8: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /288

row-store column-storeABC D A B C D

Page 9: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /289

a1 a2 a3 a4 a5 a6

b1 b2 b3 b4 b5 b6

c1 c2 c3 c4 c5 c6

virtual ids/ positional alignment

positional lookups/joinsA(i) = A + i * width(A)

tuple 1tuple 2tuple 3tuple 4tuple 5tuple 6

A B C

fixed-width + dense

columns do not need to have the

same width

Page 10: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /28

todaycolumn-stores 2.0

10

Page 11: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2811

select min(C) from R where A<10 & B<20

B<20 minCA<10A B C D IDs B CIDsdisk memory

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if

Page 12: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2811

select min(C) from R where A<10 & B<20

B<20 minCA<10A B C D IDs B CIDsdisk memory

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if

Page 13: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2811

select min(C) from R where A<10 & B<20

B<20 minCA<10A B C D IDs B CIDsdisk memory

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if

Page 14: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2811

select min(C) from R where A<10 & B<20

B<20 minCA<10A B C D IDs B CIDsdisk memory

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if

Page 15: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2811

select min(C) from R where A<10 & B<20

B<20 minCA<10A B C D IDs B CIDsdisk memory

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if

Page 16: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2811

select min(C) from R where A<10 & B<20

B<20 minCA<10A B C D IDs B CIDsdisk memory

query plan = select -> fetch -> select -> fetch - > min

sequential access patterns, max 1 if

Page 17: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2812

working over fixed width & dense columns

for (i=0;i<size;i++) if column[i]>v inter1[j++]=i

no function calls, no indirections, no auxiliary data, min ifs easy to prefetch next data values

for (i=0;i<size;i++) inter2[j++]=column[inter1[i]]

select

fetch

with data being memory resident these become significant cost components

Page 18: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2813

B<20 minCA<10 IDs B CIDs

alt1) start with B alt2) scan A & B independently and merge alt3) store intermediates as bit vectors - not positions …

Page 19: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2813

B<20 minCA<10 IDs B CIDs

alt1) start with B alt2) scan A & B independently and merge alt3) store intermediates as bit vectors - not positions …

project: basic one + more if you decide to invest in this area midterm: basic one + 2-3 alternatives

Page 20: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2814

B<20 minCA<10 IDs B CIDs

late tuple reconstruction/materialization only reconstruct to present results

no need to assemble tuples minimize memory footprint minimize data we are moving up the memory hierarchy but requires new processing engine

Page 21: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2815

disk memoryA B C D

A

ABCrow-store

engineearly tuple

reconstruction/materialization

option1

option2

column-store

engine

Page 22: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2816

possible data flow patternstuple at a time block/vector at a time column at a time

B<20 minCA<10 IDs B CIDs

Page 23: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2817

select min(C) from R where A<10 & B<20

B<20 minCA<10A B C D IDs B CIDs

A B C D B<20 minCA<10 IDs B CIDs

column-

vector-

Page 24: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2818

CEO/Co-founder of Vectorwise (now Actian) now: “changing the world, one terabyte at a time” co-founder of Snowflake

the beer analogy

Marcin Zukowski, PhD

Page 25: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2819

registers

on chip cache

on board cache

memory

disk

CPU

chea

per

fast

erop1 op2

query plan

A B

A Bop3

A

size of vector

Page 26: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2820

tuple at a time - good for minimizing memory footprint bulk processing - good minimizing functional overhead

vectorized processing - somewhere in between

Page 27: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2821

history/timeline

~1960s

tuple at a time

1980s: ideas about block processing

2005: vectorwise

tuple at a time tuple at a time

>2010: industry adoption

Page 28: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /28

project: column-at-a-time

bonus: vectorized processing

22

Page 29: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2823

update row7=(A=a,B=b,C=c,D=d)

row-store column-storeABCD A B C D

vs

which is better to update and why? how much does it cost to update a single row? (think about pages, data movement) how to update in column-stores? (query plan + algorithms)

Page 30: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /28

A

24

A B C D

B C D

base data pending updates

updatequery

periodically

Page 31: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2825

A B C D

columns copy rows copy

fractured mirrors

ABCD

optimizer

query

A case for fractured mirrorsRavishankar Ramamurthy, David J. DeWitt, Qi Su Very Large Databases Journal, 12(2): 89-101, 2003

Page 32: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2826

column-stores great for analytics

row-stores great for transactions

still basic concepts are the same

hybrids possible

keep access patterns sequential

and simple (min ifs)

Notes to remember

Page 33: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2827

reading

Read: The Design and Implementation of Modern Column-store Database Systems (Sections: all -4.6 & 4.8)by D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden

Read: IEEE Data Engineering Bulletin, 35(1), March 2012 Special Issue on Column-stores (9 short overview papers)

Page 34: class 5 column stores 2 - Harvard University

CS165, Fall 2017 Stratos Idreos /2828

research papers

Read: Database Architecture Optimized for the New Bottleneck: Memory Access Peter Boncz, Stefan Manegold, Martin Kersten In Proc. of the Very Large Databases Conference (VLDB), 1999

Browse: MonetDB/X100: Hyper-Pipelining Query Execution Peter A. Boncz, Marcin Zukowski, Niels NesIn Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2005Browse: Materialization Strategies in a Column-Oriented DBMSDaniel Abadi, Daniel Myers, David DeWitt, Samuel Madden In Proc. of the Inter. Conference on Data Engineering (ICDE), 2007

Browse: Self-organizing tuple reconstruction in column-storesStratos Idreos, Martin Kersten, Stefan Manegold In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2009

Page 35: class 5 column stores 2 - Harvard University

DATA SYSTEMSprof. Stratos Idreos

class 5

column-stores 2.0