class 6 more about column-store plans and...
TRANSCRIPT
more about column-store plansand compression
prof. Stratos IdreosHTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/
class 6
CS165, Fall 2016 Stratos Idreos /30
query compilation
2
an “ancient” yet new topic/research challenge
query->SQL->interpet to a generic query plan->c functions
query->SQL->generate and link tailored c code
CS165, Fall 2016 Stratos Idreos /30
query compilation
2
an “ancient” yet new topic/research challenge
query->SQL->interpet to a generic query plan->c functions
query->SQL->generate and link tailored c code
why this can be better/faster
CS165, Fall 2016 Stratos Idreos /30
query compilation
2
an “ancient” yet new topic/research challenge
query->SQL->interpet to a generic query plan->c functions
query->SQL->generate and link tailored c code
why this can be better/faster
why it was not ideal in the past
CS165, Fall 2016 Stratos Idreos /303
Efficiently Compiling Efficient Query Plans for Modern HardwareThomas Neumann In Proc. of the Very Large Databases Conference (VLDB), 2011
H2O: A Hands-free Adaptive StoreIoannis Alagiannis, Stratos Idreos, and Anastassia Ailamaki In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2014
Generating code for holistic query evaluationKonstantinos Krikellas, Stratis Viglas, Marcelo Cintra: In Proc. of the Inter. Conference on Data Engineering (ICDE), 2010
Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, Alfons Kemper In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2016
CS165, Fall 2016 Stratos Idreos /30
research night coming up
4
during one of the labs next week (will announce in piazza)
manos & stratos
can we have bitmap indexing that works great for both reads and writes?
UpBit: Scalable In-Memory Updatable Bitmap IndexingM. Athanassoulis, Z. Yan, and S. Idreos In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2016
CS165, Fall 2016 Stratos Idreos /305
working over fixed width & dense columns
for (i=0;i<size;i++) if column[i]>v
res[j++]=i
no function calls, no indirections, no auxiliary data, min ifs easy to prefetch next data values
for (i=0;i<size;i++) inter2[j++]=column[inter1[i]]
select
fetch
CS165, Fall 2016 Stratos Idreos /306
select min(C) from R where A<10 & B<20
B<20 minCA<10A B C D IDs B CIDs
A B C D B<20 minCA<10 IDs B CIDs
column-
vector-
CS165, Fall 2016 Stratos Idreos /307
update row7=(A=a,B=b,C=c,D=d)
A B C D A B C D
vs
cost: 1 page cost: N pages, N=# of columns
CS165, Fall 2016 Stratos Idreos /30
A
8
A B C D
B C D
base data pending updates
updatequery
periodically
CS165, Fall 2016 Stratos Idreos /309
a1 a2 a3 a4 a5 a6
b1 b2 b3 b4 b5 b6
c1 c2 c3 c4 c5 c6
tuple 1tuple 2tuple 3tuple 4tuple 5tuple 6
A B C
…
relational table
inserts, deletes affect whole table
CS165, Fall 2016 Stratos Idreos /3010
A
pending inserts pending deletesupdate= delete followed by insert
what information do we need to remember
CS165, Fall 2016 Stratos Idreos /3011
registers
on chip cache
on board cache
memory
disk
CPU
data
compute
time
speed cpu
mem
compression=
data
&
computation
but
CS165, Fall 2016 Stratos Idreos /3012
value1 value2 value3 value1 value1 value4 value2 value3 value5
…
8 bytes width
value1 value2 value3 value4 value5
001 010 011 001 001 100 010 011 101 …
8 bytes width
3 bits width
how many bits do we need for the codes (width)
CS165, Fall 2016 Stratos Idreos /3013
A B C D A B C D
which one gives better compression
CS165, Fall 2016 Stratos Idreos /3014
value1 value2 value3 value4 value5
001 010 011 001 001 100 010 011 101 …
8 bytes width
3 bits width think about
billions of entries…
can we do any better
CS165, Fall 2016 Stratos Idreos /3015
A
can we do something like huffman coding any side-effects
A dictionary
(check: Business Analytics in (a) Blink from readings)
CS165, Fall 2016 Stratos Idreos /3016
value1 value2 value3 value4 value5
001 010 011 001 001 100 010 011 101 …
8 bytes width
3 bits width
and how do we process datae.g., select A<10
CS165, Fall 2016 Stratos Idreos /3017
fixed-width is key… ok and how do we store variable length data?
dictionary of strings
fixed width codes that point to dictionary entries why
CS165, Fall 2016 Stratos Idreos /3018
compression is a bonus task for the project
Joins on Encoded and Partitioned Data Jae-Gil Lee, Gopi K. Attaluri, Ronald Barber, Naresh Chainani, Oliver Draese, Frederick Ho,
Stratos Idreos, Min-Soo Kim, Sam Lightstone, Guy M. Lohman, Konstantinos Morfonios, Keshava Murthy, Ippokratis Pandis, Lin Qiao, Vijayshankar Raman, Vincent Kulandai Samy,
Richard Sidle, Knut Stolze, Liping Zhang In Proc. of the Very Large Databases Conference (VLDB), 2014
IBM Blue
CS165, Fall 2016 Stratos Idreos /3019
B<20 minCA<10 IDs B CIDs
late tuple reconstruction/materialization only reconstruct to present results
no need to assemble tuples minimize memory footprint minimize data we are moving up the memory hierarchy but requires new processing engine
CS165, Fall 2016 Stratos Idreos /3020
Assume a column-store database with a table R(A,B,C,D,E). All attributes are integers. Our workload has two classes of queries:
1) select max(B),max(C),max(D),max(E) from R where A>v1 2) select B+C+D+E from R where A>v1
Should we use late or early tuple reconstruction plans? For each query, draw the 2 possible plans and the respective operators, explain which one is best and give the total cost.
early vs late tuple reconstruction
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3021
sel A IDs B max
sel A IDs B
result
max(B), max(C) max(D), max(E)
C D E
late TR
hybrid
IDs C max IDs D maxresult
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3022
sA IDs B
sA IDs B result+(B,C,D,E)C D E
late TR
r1 +(r1,r2) r3IDs C r2 IDs D r4 +(r3,r4) res
hybrid
CS165, Fall 2016 Stratos Idreos /3023
defaultlate tuple reconstruction (rather safe choice)
open topic for optimizationwhen and how to do selective early reconstruction
issues to considertransformation overhead materialization overhead extra passes over the data it may be almost for free (sometimes in hash joins) dynamic code generation to fit data layouts
DSM vs. NSM: CPU performance tradeoffs in block-oriented query processingMarcin Zukowski, Niels Nes, Peter A. Boncz International Workshop on Data Management on New Hardware (DaMoN) 2008
CS165, Fall 2016 Stratos Idreos /3024
essential column-stores featuresvirtual ids late tuple reconstruction (if ever) vectorized execution compression fixed-width columns
0"
5"
10"
15"
20"
25"
30"
35"
40"
45"
Column"Store" Row"Store"
Run$
me'(sec)'
Performance'of'Column3Oriented'Op$miza$ons'
–Late"Materializa:on"
–Compression"
–Join"Op:miza:on"
–Tuple@at@a@:me"
Baseline"
Column-stores vs. row-stores: how different are they really? D. Abadi, S. Madden, and N. Hachem
ACM SIGMOD Conference on Management of Data, 2008
CS165, Fall 2016 Stratos Idreos /3025
disk memoryA B C D
A
ABCrow-store
engineearly tuple
reconstruction/materialization
option1
option2
column-store
engine
CS165, Fall 2016 Stratos Idreos /3026
but why now…weren’t all those design options obvious in the past as well?
moving data from disk
moving data from memory
computation 1) big memories 2) cpu vs memory speed
CS165, Fall 2016 Stratos Idreos /30
main-memory systems
27
optimized for the memory wall
with or without persistent data
CS165, Fall 2016 Stratos Idreos /3028
other system categoriesnoSQL, newSQL, key-value stores, matlab, etc..
column-stores = bad name modern systems
CS165, Fall 2016 Stratos Idreos /3029
research papers
Integrating compression and execution in column-oriented database systemsDaniel J. Abadi, Samuel Madden, Miguel Ferreira In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2006
Updating a cracked databaseStratos Idreos, Martin Kersten, Stefan Manegold In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2007
Positional update handling in column storesSándor Héman, Marcin Zukowski, Niels J. Nes, Lefteris Sidirourgos, Peter A. Boncz In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2010
Column-stores vs. row-stores: how different are they really? D. Abadi, S. Madden, and N. Hachem In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2008
DATA SYSTEMSprof. Stratos Idreos
class 6
more about column-store plans and compression