cloudera impala

46
1 Cloudera Impala Portland Big Data User Group, July 2014 Alex Moundalexis @technmsg

Upload: alex-moundalexis

Post on 25-Dec-2014

166 views

Category:

Documents


7 download

DESCRIPTION

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale. As presented to Portland Big Data User Group on July 23rd 2014. http://www.meetup.com/Hadoop-Portland/events/194930422/

TRANSCRIPT

Page 1: Cloudera Impala

1

Cloudera  Impala  Portland  Big  Data  User  Group,  July  2014    Alex  Moundalexis  @technmsg  

Page 2: Cloudera Impala

Thirty  Seconds  About  Alex  

•  SoluGons  Architect  •  aka  consultant  •  government  •  infrastructure  

•  former  coder  of  Perl  •  former  administrator  •  fan  of  Portland    

2  

Page 3: Cloudera Impala

What  Does  Cloudera  Do?  

•  product  •  distribuGon  of  Hadoop  components,  Apache  licensed  •  enterprise  tooling  

•  support  •  training  •  services  (aka  consulGng)  •  community  

3

Page 4: Cloudera Impala

Disclaimer  

•  Cloudera  builds  things  soPware  •  most  donated  to  Apache  •  some  closed-­‐source  

•  Cloudera  “products”  I  reference  are  open  source  •  Apache  Licensed  •  source  code  is  on  GitHub  

•  hVps://github.com/cloudera  

4

Page 5: Cloudera Impala

What  This  Talk  Isn’t  About  

•  deploying  •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor  

•  sizing  &  tuning  •  depends  heavily  on  data  and  workload  

•  coding  •  unless  you  count  XML  or  CSV  or  SQL  

•  algorithms  

5

Page 6: Cloudera Impala

Public  Domain  IFCAR  

Page 7: Cloudera Impala

CC  BY-­‐SA  Lilian  De  Cassai  

Page 8: Cloudera Impala

cloud·∙e·∙ra  im·∙pal·∙a  

8

/kloudˈi(ə)rə  imˈpalə/    noun    

a  modern,  open  source,  MPP  SQL  query  engine  for  Apache  Hadoop.    “Cloudera  Impala  provides  fast,  ad  hoc  SQL  query  capability  for  Apache  Hadoop,  complemenGng  tradiGonal  MapReduce  batch  processing.”  

Page 9: Cloudera Impala

9

Quick  and  dirty,  for  context.  

The  Apache  Hadoop  Ecosystem  

Page 10: Cloudera Impala

Why  “Ecosystem?”  

•  In  the  beginning,  just  Hadoop  •  HDFS  •  MapReduce  

•  Today,  dozens  of  interrelated  components  •  I/O  •  Processing  •  Specialty  ApplicaGons  •  ConfiguraGon  •  Workflow  

10

Page 11: Cloudera Impala

HDFS  

•  Distributed,  highly  fault-­‐tolerant  filesystem  •  OpGmized  for  large  streaming  access  to  data  •  Based  on  Google  File  System  

•  hVp://research.google.com/archive/gfs.html  

11

Page 12: Cloudera Impala

Lots  of  Commodity  Machines  

12

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Page 13: Cloudera Impala

MapReduce  (MR)  

•  Programming  paradigm  •  Batch  oriented,  not  realGme  •  Works  well  with  distributed  compuGng  •  Lots  of  Java,  but  other  languages  supported  •  Based  on  Google’s  paper  

•  hVp://research.google.com/archive/mapreduce.html  

13

Page 14: Cloudera Impala

Under  the  Covers  

14

Page 15: Cloudera Impala

You specify map() and reduce() functions. ���

���The framework does the

rest. 60

Page 16: Cloudera Impala

Apache  Hive  

•  AbstracGon  of  Hadoop’s  Java  API  •  HiveQL  “compiles”  down  to  MR  

•  a  “SQL-­‐like”  language  

•  Eases  analysis  using  MapReduce  

16

Page 17: Cloudera Impala

Apache  Hive  Metastore  

•  Maps  HDFS  files  to  DB-­‐like  resources  •  Databases  •  Tables  •  Column/field  names,  data  types  •  Roles/users  •  InputFormat/OutputFormat  

17

Page 18: Cloudera Impala

WHY  DO  WE  NEED  THIS?  But  wait…  

18  

Page 19: Cloudera Impala

19  

Page 20: Cloudera Impala

20

I  am  not  a  SQL  wizard  by  any  means…  

Super  Shady  SQL  Supplement  

Page 21: Cloudera Impala

A  Simple  RelaGonal  Database  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

21

>  

Page 22: Cloudera Impala

InteracGng  with  RelaGonal  Data  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

22

>  SELECT  *  FROM  people;  

Page 23: Cloudera Impala

InteracGng  with  RelaGonal  Data  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

23

>  SELECT  *  FROM  people;  

Page 24: Cloudera Impala

RequesGng  Specific  Fields  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

24

>  SELECT  name,  state  FROM  people;  

Page 25: Cloudera Impala

RequesGng  Specific  Fields  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

25

>  SELECT  name,  state  FROM  people;  

Page 26: Cloudera Impala

RequesGng  Specific  Rows  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

26

>  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  

Page 27: Cloudera Impala

RequesGng  Specific  Rows  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

27

>  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  

Page 28: Cloudera Impala

Two  Simple  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

28  

>  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Page 29: Cloudera Impala

Joining  Two  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

29  

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

 name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Page 30: Cloudera Impala

Joining  Two  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

30  

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

 name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Page 31: Cloudera Impala

Joining  Two  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

31  

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Page 32: Cloudera Impala

Joining  Two  Tables  

32

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

owner   state   pet  

Alex   Maryland   Marvin  

Joey   Maryland   Brain  

Sean   Texas  

Paris   Maryland  

Page 33: Cloudera Impala

Varying  ImplementaGon  of  JOIN  

33

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

owner   state   pet  

Alex   Maryland   Marvin  

Joey   Maryland   Brain  

Sean   Texas   ?  

Paris   Maryland   ?  

Page 34: Cloudera Impala

34

Familiar  interface,  but  more  powerful.  

Cloudera  Impala  

Page 35: Cloudera Impala

Cloudera  Impala  

•  InteracGve  query  on  Hadoop  •  think  seconds,  not  minutes  

•  Nearly  ANSI-­‐92  standard  SQL  •  compaGble  with  HiveQL  

•  NaGve  MPP  query  engine  •  built  for  low-­‐latency  queries  

35

Page 36: Cloudera Impala

Cloudera  Impala  –  Design  Choices  

•  NaGve  daemons,  wriVen  in  C/C++  •  No  JVM,  no  MapReduce  •  Saturate  disks  on  reads  •  Uses  in-­‐memory  HDFS  caching  

•  Re-­‐uses  Hive  metastore  •  Not  as  fault-­‐tolerant  as  MapReduce  

36

Page 37: Cloudera Impala

Cloudera  Impala  –  Architecture  

•  Impala  Daemon  •  runs  on  every  node  •  handles  client  requests  •  handles  query  planning  &  execuGon  

•  State  Store  Daemon  •  provides  name  service  •  metadata  distribuGon  •  used  for  finding  data  

37

Page 38: Cloudera Impala

Impala  Query  ExecuGon  

38

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  request  

1)  Request  arrives  via  ODBC/JDBC/HUE/Shell  

Page 39: Cloudera Impala

Impala  Query  ExecuGon  

39

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

2)  Planner  turns  request  into  collecPons  of  plan  fragments  3)  Coordinator  iniPates  execuPon  on  impalad(s)  local  to  data  

Page 40: Cloudera Impala

Impala  Query  ExecuGon  

40

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

4)  Intermediate  results  are  streamed  between  impalad(s)  5)  Query  results  are  streamed  back  to  client  

Query  results  

Page 41: Cloudera Impala

Cloudera  Impala  –  Results  

•  Allows  for  fast  iteraGon/discovery  •  How  much  faster?  

•  3-­‐4x  faster  on  I/O  bound  workloads  •  up  to  45x  faster  on  mulG-­‐MR  queries  •  up  to  90x  faster  on  in-­‐memory  cache  

41

Page 42: Cloudera Impala

42

Hold  onto  something,  folks.  

Demo  

Page 43: Cloudera Impala

What’s  Next?  

•  Download  Hadoop!  •  CDH  available  at  www.cloudera.com  •  Already  done  that?  Contribute…  

•  Cloudera  provides  pre-­‐loaded  VMs  •  hVp://Gny.cloudera.com/quickstartvm  

•  Clone  our  repos!  •  hVps://github.com/cloudera  

43

Page 44: Cloudera Impala

PORTLAND  Special  thanks:  

44  

Page 45: Cloudera Impala

45

Preferably  related  to  the  talk…  or  not.  

QuesGons?  

Page 46: Cloudera Impala

46

Thank  You!  Alex  Moundalexis  @technmsg    We’re  hiring,  kids!  Well,  not  kids.