ig data systems - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf ·...

28
BIG D ATA SYSTEMS CS 564 Fall 2015 ACKs: Magda Balazinska

Upload: others

Post on 21-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

BIG DATA SYSTEMS

CS  564-­‐‑ Fall  2015

ACKs:  Magda  Balazinska

Page 2: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

BIG DATA

Definition  from  industry:  • high  volume  • high  variety  • high  velocity

2CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 3: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

VOLUME

• Databases  parallelize  easily;  techniques  available  from  the  80’s  (GAMMA  project)– data  partitioning  – parallel  query  processing  

• SQL  is  embarrassingly  parallel  

3CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 4: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

VARIETY

• complex  workloads:– Machine  Learning  tasks:  e.g.  click  prediction,  topic  modeling,  SVM,  k-­‐‑means  

• various  types  of  data:– text  data– semi-­‐‑structured  data– graph  data– multimedia  (video,  photos)

4CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 5: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

VELOCITY

• data  is  generated  very  fast  and  needs  to  be  processed  very  fast– real  time  analytics– data  streaming  (each  data  item  can  be  processed  only  once!)

5CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 6: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

ANOTHER V:  VERACITY

The  data  collected  is  often  uncertain• inconsistent  data• incomplete  data• ambiguous  data  

Example:  sensor  data

6CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 7: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

DATA LANDSCAPE

7CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 8: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

SOME EXAMPLES

• Greenplum:  founded  in  2003  acquired  by  EMC  in  2010.  A  parallel  shared-­‐‑nothing  DBMS  

• Vertica:  founded  in  2005  and  acquired  by  HP  in  2011.  A  parallel  column-­‐‑store  shared-­‐‑nothing  DBMS

• AsterData:  founded  in  2005  acquired  by  Teradata  in  2011.  A  parallel,  shared-­‐‑nothing,  MapReduce-­‐‑based  data  processing  system

• Netezza:  founded  in  2000  and  acquired  by  IBM  in  2010.  A  parallel  shared-­‐‑nothing  DBMS

8CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 9: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

WE WILL SEE 2  APPROACHES

• Parallel  databases,  started  at  the  80s– OLTP  (transaction  processing)– OLAP  (decision  support  queries)

• MapReduce– first  developed  by  Google,  published  in  2004– only  for  decision  support  queries  – ecosystem  around  it:  Hadoop,  PigLatin,  Hive,  …

9CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 10: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

PARALLEL DBMS

• The  goal  is  to  improve  performance  by  executing  multiple  operations  in  parallel  (scale-­‐‑out)

• Terminology  to  measure  performance:  – Speed-­‐‑up:  using  more  processors,  how  much  faster  does  the  task  run  (if  problem  size  is  fixed)?

– Scale-­‐‑up:  using  more  processors,  does  performance  remain  the  same  as  we  increase  the  problem  size?

10CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 11: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

SCALE-­‐‑UP VS SCALE-­‐‑OUT

Scale-­‐‑up• using  more  powerful  machines,  more  processors/RAM  per  machine

Scale-­‐‑out• using  a  larger  number  of  servers

11CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 12: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

ARCHITECTURES

• Shared  memory– nodes  share  RAM  +  disk– easy  to  program,  expensive  to  scale

• Shared  disk– nodes  access  the  same  disk,  hard  to  scale

• Shared  nothing– nodes  have  their  own  RAM+disk– connected  through  a  fast  network

12CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 13: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

PARALLEL QUERY EVALUATION

• Inter-­‐‑query  parallelism:– each  query  runs  on  one  processor  

• Inter-­‐‑operator  parallelism:  – each  query  runs  on  multiple  processors– an  operator  runs  on  one  processor  

• Intra-­‐‑operator  parallelism:  – An  operator  runs  on  multiple  processors

13CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 14: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

PARALLEL DATA STORAGE

Horizontal  data  partitioning• block partitioned• hash partitioned• rangepartitioned

Uniform  vs  skewed  partitioning

14CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 15: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

PARALLEL QUERY EVALUATION

• Parallel  Selection

• Parallel  Join  – hash  join– broadcast  join

15CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 16: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

MAPREDUCE

• Google  [Dean  2004]  • Open  source  implementation:  Hadoop• MapReduce:  – high-­‐‑level  programming  model  and  implementation  for  large-­‐‑scale  parallel  data  processing

– designed  to  simplify  task  of  writing  parallel  programs

16CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 17: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

MAPREDUCE

• Hides  messy  details  in  MapReduce runtime  library– automatic  parallelization– load  balancing– network  and  disk  transfer  optimizations– handling  of  failures– robustness

17CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 18: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

MAPREDUCE PIPELINE

• read  the  partitioned  data  (HDFS,  GFS)  • Map:  extract  something  you  care  about  from  each  record

• Shuffle  and  Sort  (done  by  the  system)• Reduce:  aggregate,  summarize,  filter,  transform  • write  the  results

18CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 19: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

MAPREDUCE DATAFLOW

19CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

source:  Hadoop  – The  Definitive  Guide,  by  Tom  White

Page 20: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

DATA MODEL

• A  file  =  a  bag  of  (key,  value)  pairs  

• A  MapReduce program:– Input:  a  bag  of  (input  key,  value)  pairs– Output:  a  bag  of  (output  key,  value)  pairs

20CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 21: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

THE MAP FUNCTION

User  provides  the  MAP function:• Input:  (input  key,  value)  • Output:  bag  of  (intermediate  key,  value)  

The  system  applies  the  map  function  in  parallel  to  all  (input  key,  value)  pairs  in  the  input  file  

21CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 22: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

THE REDUCE FUNCTION

User  provides  the  REDUCE function:• Input:  (intermediate  key,  bag  of  values)• Output:  bag  of  (output  key,  values)  

The  system  groups  all  pairs  with  the  same  intermediate  key,  and  passes  the  bag  of  values  to  the  REDUCE  function

22CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 23: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

EXAMPLE:  WORD COUNT

• Count  the  number  of  occurrences  of  each  word  in  a  large  collection  of  documents

• Each  Document  – key  =  document  id  (did)– value  =  set  of  words  (word)  

23CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 24: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

MAPREDUCE JOBS

• A  MapReduce job  consists  of  one  single  “query”– e.g.  count  the  words  in  all  docs  

• More  complex  queries  may  consist  of  multiple  jobs

24CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 25: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

MAPREDUCE ECOSYSTEM

Lots  of  extensions  to  address  limitations:• Capabilities  to  write  DAGs  of  MapReduce jobs• Declarative  languages  • Most  companies  use  both  types  of  engines  (MR  and  DBMS),  with  increased  integration

• Potential  replacement  to  MapReduce:  Spark

25CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 26: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

MAPREDUCE ECOSYSTEM

PIG  Latin  (Yahoo!)• New  language,  like  Relational  Algebra• open  source  Hive  (Facebook)• SQL-­‐‑like  language• open  sourceSQL  /  Tenzing (Google)• SQL  on  MR  • Proprietary  – morphed  into  BigQuery

26CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 27: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

PARALLEL DBMS  VS MAPREDUCE

Parallel  DBMS:• Relational  data  model  and  schema• Declarative  query  language:  SQL• Can  easily  combine  operators  into  complex  queries  • Query  optimization,  indexing,  and  physical  tuning• Streams  data  from  one  operator  to  the  next  without  blocking  

27CS  564  [Fall  2015]  -­‐‑ Paris  Koutris

Page 28: IG DATA SYSTEMS - pages.cs.wisc.edupages.cs.wisc.edu/~paris/cs564-f15/lectures/lecture-18.pdf · VARIETY • complex&workloads: – Machine&Learning&tasks:&e.g.&click&prediction,&topic&

PARALLEL DBMS  VS MAPREDUCE

MapReduce:• data  model  is  a  file  with  key-­‐‑value  pairs• no  need  to  “load  data”  before  processing• easy  to  write  user-­‐‑defined  operators• can  easily  add  nodes  to  the  cluster• intra-­‐‑query  fault-­‐‑tolerance  thanks  to  results  on  disk  • Arguably  more  scalable,  but  also  needs  more  nodes

28CS  564  [Fall  2015]  -­‐‑ Paris  Koutris