cloudera impala

Post on 25-Dec-2014

166 Views

Category:

Documents

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale. As presented to Portland Big Data User Group on July 23rd 2014. http://www.meetup.com/Hadoop-Portland/events/194930422/

TRANSCRIPT

1

Cloudera  Impala  Portland  Big  Data  User  Group,  July  2014    Alex  Moundalexis  @technmsg  

Thirty  Seconds  About  Alex  

•  SoluGons  Architect  •  aka  consultant  •  government  •  infrastructure  

•  former  coder  of  Perl  •  former  administrator  •  fan  of  Portland    

2  

What  Does  Cloudera  Do?  

•  product  •  distribuGon  of  Hadoop  components,  Apache  licensed  •  enterprise  tooling  

•  support  •  training  •  services  (aka  consulGng)  •  community  

3

Disclaimer  

•  Cloudera  builds  things  soPware  •  most  donated  to  Apache  •  some  closed-­‐source  

•  Cloudera  “products”  I  reference  are  open  source  •  Apache  Licensed  •  source  code  is  on  GitHub  

•  hVps://github.com/cloudera  

4

What  This  Talk  Isn’t  About  

•  deploying  •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor  

•  sizing  &  tuning  •  depends  heavily  on  data  and  workload  

•  coding  •  unless  you  count  XML  or  CSV  or  SQL  

•  algorithms  

5

Public  Domain  IFCAR  

CC  BY-­‐SA  Lilian  De  Cassai  

cloud·∙e·∙ra  im·∙pal·∙a  

8

/kloudˈi(ə)rə  imˈpalə/    noun    

a  modern,  open  source,  MPP  SQL  query  engine  for  Apache  Hadoop.    “Cloudera  Impala  provides  fast,  ad  hoc  SQL  query  capability  for  Apache  Hadoop,  complemenGng  tradiGonal  MapReduce  batch  processing.”  

9

Quick  and  dirty,  for  context.  

The  Apache  Hadoop  Ecosystem  

Why  “Ecosystem?”  

•  In  the  beginning,  just  Hadoop  •  HDFS  •  MapReduce  

•  Today,  dozens  of  interrelated  components  •  I/O  •  Processing  •  Specialty  ApplicaGons  •  ConfiguraGon  •  Workflow  

10

HDFS  

•  Distributed,  highly  fault-­‐tolerant  filesystem  •  OpGmized  for  large  streaming  access  to  data  •  Based  on  Google  File  System  

•  hVp://research.google.com/archive/gfs.html  

11

Lots  of  Commodity  Machines  

12

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce  (MR)  

•  Programming  paradigm  •  Batch  oriented,  not  realGme  •  Works  well  with  distributed  compuGng  •  Lots  of  Java,  but  other  languages  supported  •  Based  on  Google’s  paper  

•  hVp://research.google.com/archive/mapreduce.html  

13

Under  the  Covers  

14

You specify map() and reduce() functions. ���

���The framework does the

rest. 60

Apache  Hive  

•  AbstracGon  of  Hadoop’s  Java  API  •  HiveQL  “compiles”  down  to  MR  

•  a  “SQL-­‐like”  language  

•  Eases  analysis  using  MapReduce  

16

Apache  Hive  Metastore  

•  Maps  HDFS  files  to  DB-­‐like  resources  •  Databases  •  Tables  •  Column/field  names,  data  types  •  Roles/users  •  InputFormat/OutputFormat  

17

WHY  DO  WE  NEED  THIS?  But  wait…  

18  

19  

20

I  am  not  a  SQL  wizard  by  any  means…  

Super  Shady  SQL  Supplement  

A  Simple  RelaGonal  Database  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

21

>  

InteracGng  with  RelaGonal  Data  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

22

>  SELECT  *  FROM  people;  

InteracGng  with  RelaGonal  Data  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

23

>  SELECT  *  FROM  people;  

RequesGng  Specific  Fields  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

24

>  SELECT  name,  state  FROM  people;  

RequesGng  Specific  Fields  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

25

>  SELECT  name,  state  FROM  people;  

RequesGng  Specific  Rows  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

26

>  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  

RequesGng  Specific  Rows  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

27

>  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  

Two  Simple  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

28  

>  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Joining  Two  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

29  

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

 name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Joining  Two  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

30  

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

 name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Joining  Two  Tables  

owner   species   name  

Alex   Cactus   Marvin  

Joey   Cat   Brain  

Sean   None  

Paris   Unknown  

31  

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

name   state   employer   year  

Alex   Maryland   Cloudera   2013  

Joey   Maryland   Cloudera   2011  

Sean   Texas   Cloudera   2013  

Paris   Maryland   AOL   2011  

Joining  Two  Tables  

32

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

owner   state   pet  

Alex   Maryland   Marvin  

Joey   Maryland   Brain  

Sean   Texas  

Paris   Maryland  

Varying  ImplementaGon  of  JOIN  

33

>  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner  

owner   state   pet  

Alex   Maryland   Marvin  

Joey   Maryland   Brain  

Sean   Texas   ?  

Paris   Maryland   ?  

34

Familiar  interface,  but  more  powerful.  

Cloudera  Impala  

Cloudera  Impala  

•  InteracGve  query  on  Hadoop  •  think  seconds,  not  minutes  

•  Nearly  ANSI-­‐92  standard  SQL  •  compaGble  with  HiveQL  

•  NaGve  MPP  query  engine  •  built  for  low-­‐latency  queries  

35

Cloudera  Impala  –  Design  Choices  

•  NaGve  daemons,  wriVen  in  C/C++  •  No  JVM,  no  MapReduce  •  Saturate  disks  on  reads  •  Uses  in-­‐memory  HDFS  caching  

•  Re-­‐uses  Hive  metastore  •  Not  as  fault-­‐tolerant  as  MapReduce  

36

Cloudera  Impala  –  Architecture  

•  Impala  Daemon  •  runs  on  every  node  •  handles  client  requests  •  handles  query  planning  &  execuGon  

•  State  Store  Daemon  •  provides  name  service  •  metadata  distribuGon  •  used  for  finding  data  

37

Impala  Query  ExecuGon  

38

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  request  

1)  Request  arrives  via  ODBC/JDBC/HUE/Shell  

Impala  Query  ExecuGon  

39

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

2)  Planner  turns  request  into  collecPons  of  plan  fragments  3)  Coordinator  iniPates  execuPon  on  impalad(s)  local  to  data  

Impala  Query  ExecuGon  

40

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

4)  Intermediate  results  are  streamed  between  impalad(s)  5)  Query  results  are  streamed  back  to  client  

Query  results  

Cloudera  Impala  –  Results  

•  Allows  for  fast  iteraGon/discovery  •  How  much  faster?  

•  3-­‐4x  faster  on  I/O  bound  workloads  •  up  to  45x  faster  on  mulG-­‐MR  queries  •  up  to  90x  faster  on  in-­‐memory  cache  

41

42

Hold  onto  something,  folks.  

Demo  

What’s  Next?  

•  Download  Hadoop!  •  CDH  available  at  www.cloudera.com  •  Already  done  that?  Contribute…  

•  Cloudera  provides  pre-­‐loaded  VMs  •  hVp://Gny.cloudera.com/quickstartvm  

•  Clone  our  repos!  •  hVps://github.com/cloudera  

43

PORTLAND  Special  thanks:  

44  

45

Preferably  related  to  the  talk…  or  not.  

QuesGons?  

46

Thank  You!  Alex  Moundalexis  @technmsg    We’re  hiring,  kids!  Well,  not  kids.  

top related