impala: a modern sql engine for hadoop

20
Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Cloudera Impala: A Modern SQL Engine for Hadoop JusIn Erickson | Senior Product Manager, Cloudera May 2013

Upload: doancong

Post on 03-Jan-2017

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Impala: A Modern SQL Engine for Hadoop

Headline  Goes  Here  Speaker  Name  or  Subhead  Goes  Here  

DO  NOT  USE  PUBLICLY  PRIOR  TO  10/23/12  

Cloudera  Impala:  A  Modern  SQL  Engine  for  Hadoop  JusIn  Erickson  |    Senior  Product  Manager,  Cloudera  May  2013  

Page 2: Impala: A Modern SQL Engine for Hadoop

Agenda  

• Why  Impala?  • Architectural  Overview  • AlternaIve  Approaches  • Project  Status  

©2013 Cloudera, Inc. All Rights Reserved. 2

Page 3: Impala: A Modern SQL Engine for Hadoop

Why  Hadoop?  •  Scalability  

•  Simply  scales  just  by  adding  nodes  •  Local  processing  to  avoid  network  boTlenecks  

•  Flexibility  •  All  kinds  of  data  (blobs,  documents,  records,  etc)  •  In  all  forms  (structured,  semi-­‐structured,  unstructured)  •  Store  anything  then  later  analyze/process  what  you  need  •  Analyze/process  the  data  how  you  need  to  

•  Efficiency  •  Cost  efficient  soZware  on  commodity  hardware  •  Unified  storage,  metadata,  security  (no  duplicaIon  or  synchronizaIon)  

©2013 Cloudera, Inc. All Rights Reserved. 3

Page 4: Impala: A Modern SQL Engine for Hadoop

What’s  Impala?  •  Interac<ve  SQL  

•  Typically  5-­‐65x  faster  than  Hive  (observed  up  to  100x  faster)  •  Responses  in  seconds  instead  of  minutes  (someImes  sub-­‐second)  

•  Approx.  ANSI-­‐92  standard  SQL  queries  with  HiveQL  •  CompaIble  SQL  interface  for  exisIng  Hadoop/CDH  applicaIons  •  Based  on  industry  standard  SQL  

•  Na<vely  on  Hadoop/HBase  storage  and  metadata  •  Flexibility,  scale,  and  cost  advantages  of  Hadoop  •  No  duplicaIon/synchronizaIon  of  data  and  metadata  •  Local  processing  to  avoid  network  boTlenecks  

•  Separate  run<me  from  MapReduce  •  MapReduce  is  designed  and  great  for  batch  •  Impala  is  purpose-­‐built  for  low-­‐latency  SQL  queries  on  Hadoop  

©2013 Cloudera, Inc. All Rights Reserved. 4

Page 5: Impala: A Modern SQL Engine for Hadoop

So  what?  •  Interac<ve  BI/analy<cs  

•  BI  tools  impracIcal  on  Hadoop  before  Impala  •  Move  from  10s  of  Hadoop  users  per  cluster  to  100s  of  SQL  users  •  More  and  faster  value  from  “big  data”  

•  Exploratory  analy<cs  •  Ask  and  explore  new  quesIons  on  raw,  granular  data  •  No  upfront  ETL  process  to  access  data  

•  Queryable  archive  with  full  fidelity  •  Keep  historical  data  acIve  in  Hadoop  instead  of  inaccessible  tape  

•  Data  processing  with  <ght  SLAs  •  Sub-­‐minute  SLAs  now  possible  

©2013 Cloudera, Inc. All Rights Reserved. 5

Page 6: Impala: A Modern SQL Engine for Hadoop

Impala  Architecture  •  Two  binaries:  impalad  and  statestored  

•  Impala  daemon  (impalad)  •  one  Impala  daemon  on  each  node  with  data  •  handles  external  client  requests  and  all  internal  requests  related  to  query  execuIon  

•  State  store  daemon  (statestored)  •  provides  name  service  and  metadata  distribuIon  •  not  part  of  query  execuIon  path  

©2013 Cloudera, Inc. All Rights Reserved. 6

Page 7: Impala: A Modern SQL Engine for Hadoop

Impala  Architecture:  Query  ExecuIon  Phases  

• Client  SQL  arrives  via  ODBC/JDBC/Hue  GUI/Shell  • Planner  turns  request  into  collecIons  of  plan  fragments  • Coordinator  iniIates  execuIon  on  impalad's  local  to  data  • During  execu<on:  

•  intermediate  results  are  streamed  between  executors  •  query  results  are  streamed  back  to  client  •  subject  to  limitaIons  imposed  to  blocking  operators  (top-­‐n,  aggregaIon)  

©2013 Cloudera, Inc. All Rights Reserved. 7

Page 8: Impala: A Modern SQL Engine for Hadoop

Impala  Architecture:  Planner  •  Example:  query  with  join  and  aggregaIon  

SELECT  state,  SUM(revenue)  FROM  HdfsTbl  h  JOIN  HbaseTbl  b  ON  (...)  GROUP  BY  1  ORDER  BY  2  desc  LIMIT  10  

Hbase  Scan  

Hash  Join  

Hdfs  Scan   Exch  

TopN  

Agg  

Exch  at  coordinator   at  DataNodes   at  region  servers  

Agg  TopN  

Agg  Hash  Join  

Hdfs  Scan  

Hbase  Scan  

©2013 Cloudera, Inc. All Rights Reserved. 8

Page 9: Impala: A Modern SQL Engine for Hadoop

Impala  Architecture:  Query  ExecuIon  •  Request  arrives  via  ODBC/JDBC/Hue  GUI/Shell  

Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  

SQL  request  

©2013 Cloudera, Inc. All Rights Reserved. 9

Page 10: Impala: A Modern SQL Engine for Hadoop

Impala  Architecture:  Query  ExecuIon  •  Planner  turns  request  into  collecIons  of  plan  fragments  •  Coordinator  iniIates  execuIon  on  impalad's  local  to  data  

Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  

SQL  App  ODBC  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  

Hive  Metastore   HDFS  NN   Statestore  

©2013 Cloudera, Inc. All Rights Reserved. 10

Page 11: Impala: A Modern SQL Engine for Hadoop

Impala  Architecture:  Query  ExecuIon  •  Intermediate  results  are  streamed  between  impalad’s  •  Query  results  are  streamed  back  to  client  

Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  

SQL  App  ODBC  

Hive  Metastore   HDFS  NN   Statestore  

Query  Planner  Query  Coordinator  Query  Executor  

HDFS  DN   HBase  

Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  

query  results  

©2013 Cloudera, Inc. All Rights Reserved. 11

Page 12: Impala: A Modern SQL Engine for Hadoop

Impala  and  Hive  •  Everything  Client-­‐Facing  is  Shared  with  Hive:  

•  Metadata  (table  definiIons)  •  ODBC/JDBC  drivers  •  Hue  GUI  •  SQL  syntax  (HiveQL)  •  Flexible  file  formats  •  Machine  pool  

•  Internal  Improvements:  •  Purpose-­‐built  query  engine  direct  on  HDFS  and  HBase  •  No  JVM  startup  and  no  MapReduce  •  In-­‐memory  data  transfers  •  Modern  tech  including  special  hardware  instrucIons,  runIme  code  generaIon,  etc  •  NaIve  distributed  relaIonal  query  engine  

©2013 Cloudera, Inc. All Rights Reserved. 12

Page 13: Impala: A Modern SQL Engine for Hadoop

What  about  an  EDW/RDBMS?  •  “Right  tool  for  the  right  job”  

•  EDW/RDBMS  great  for:  •  OLTP’s  complex  transacIons  •  Highly  planned  and  opImized  known  workloads  •  Opera4onal  reports  and  drill  into  repeated  known  queries  

•  Impala’s  great  for:  •  Exploratory  analy4cs  with  new  previously  unknown  queries  •  Queries  on  big  and  growing  data  sets  

•  EDW/RDBMS  can’t:  •  Dump  in  raw  data  then  later  define  schema  and  query  what  you  want  •  Evolve  schemas  without  an  expensive  schema  upgrade  planning  process  •  Simply  scale  just  by  adding  nodes  •  Store  at  a  $/TB  conducive  to  big  data  

©2013 Cloudera, Inc. All Rights Reserved. 13

Page 14: Impala: A Modern SQL Engine for Hadoop

AlternaIve  Hadoop  Query  Approaches  

©2013 Cloudera, Inc. All Rights Reserved.

Query  Node  

Query  Node  

Query  Node  

NN  

DN   DN   DN  

Query  Engine  

Query  Node  

DN  MR  

Hive  OR  

HDFS  MR  

Batch  MapReduce   Siloed  DBMS  Remote  Query  

DN  

MR  

High-­‐latency  MR    Separate  nodes  for  SQL/MR    Duplicate  metadata,  security,  SQL,  MR,  etc.    

Network  boTleneck    Separate  nodes  for  SQL/MR    Duplicate  metadata,  security,  SQL,  MR,  etc.      

Query  subset  of  siloed  data    TradiIonal  RDBMS  rigidity    Duplicate  storage,  metadata,  security,  SQL,  etc.  

14  

Page 15: Impala: A Modern SQL Engine for Hadoop

Comparing  Impala  to  Google  Dremel  •  What  is  Google  Dremel:  

•  columnar  storage  for  data  with  nested  structures  •  distributed  scalable  aggregaIon  on  top  of  that  

•  Columnar  storage  in  Hadoop:  Parquet  •  joint  project  between  Cloudera  and  TwiTer  •  new  columnar  format,  derived  from  Doug  Cunng's  Trevni  •  stores  data  in  appropriate  naIve/binary  types  •  can  also  store  nested  structures  similar  to  Dremel's  ColumnIO  

•  Distributed  aggregaIon:  Impala  

•  Impala  +  Parquet:  superset  of  the  published  version  of  Dremel  adding:  •  Joins  •  MulIple  file  format  support  

©2013 Cloudera, Inc. All Rights Reserved. 15

Page 16: Impala: A Modern SQL Engine for Hadoop

Impala  GA  (Beginning  of  Q2  2013)  •  ~ANSI-­‐92  SQL  

•  CREATE,  ALTER,  SELECT,  INSERT,  JOIN,  subqueries,  etc.  •  NaIve  Hadoop  data  formats:  

•  Avro,  SequenceFile,  RCFile  with  Snappy,  GZIP,  BZIP,  or  uncompressed  •  Text  (uncompressed  or  LZO-­‐compressed)  •  Parquet  columnar  format  with  Snappy  or  uncompressed  

•  Full  CDH  4  64-­‐bit  packages:  •  RHEL  6.2/5.7,  Ubuntu,  Debian,  SLES  

•  ConnecIvity:  •  JDBC,  ODBC,  Hue  GUI,  command-­‐line  shell  

•  Performance:  •  Bigger  and  faster  joins  (parIIoned  joins)  •  Fully  distributed  aggregaIons  •  Fully  distributed  top-­‐n  queries  •  More  opImized  SQL  funcIons  

•  ProducIon-­‐readiness:  •  Kerberos  authenIcaIon  •  MR/Impala  resource  isolaIon  

©2013 Cloudera, Inc. All Rights Reserved. 16

Page 17: Impala: A Modern SQL Engine for Hadoop

Impala  Post-­‐GA  Roadmap  •  Security  

•  AuthorizaIon  •  AudiIng  •  LDAP/AcIve  Directory  username/password  authenIcaIon  

•  AddiIonal  SQL:  •  UDFs  •  SQL  authorizaIon  and  DDL  •  ORDER  BY  without  LIMIT  •  window  funcIons  •  support  for  structured  data  types  

•  Improved  HBase  support:  •  CREATE/INSERT/UPDATE/DELETE  •  composite  keys,  complex  types  in  columns  •  index  nested-­‐loop  joins  

•  ConInued  performance  gains:  •  straggler  handling  •  join  order  opImizaIon  •  improved  cache  management  •  data  collocaIon  for  improved  join  performance  

•  BeTer  metadata  handling:  •  automaIc  metadata  distribuIon  through  statestore  

•  ConInued  resource  control  enhancements  with  MapReduce  

©2013 Cloudera, Inc. All Rights Reserved. 17

Page 18: Impala: A Modern SQL Engine for Hadoop

It’s  Not  Just  About  SQL  on  Hadoop  

18

The  PlaVorm  for  Big  Data  

Storage  

Integra<on  

Resource  Management  

Metad

ata  

Batch  Processing  

MAPREDUCE,  HIVE  &  PIG  

…InteracIve  

SQL  IMPALA  

Machine  

Learning  MAHOUT,  DATAFU  

HDFS   HBase  

TEXT,  RCFILE,  PARQUET,  AVRO…   RECORDS  

Engines  

Management    |    Support  

•  Single  plaworm  for  processing  &  analyIcs  

•  Scales  to  ‘000s  of  servers  

•  No  upfront  schema  

•  10%  the  cost  per  TB  

•  Open  source  plaworm  ©2013 Cloudera, Inc. All Rights

Reserved.

Page 19: Impala: A Modern SQL Engine for Hadoop

Try  It  Out!  

•  Impala  1.0  GA  released  4/30/2013  • 100%  Apache-­‐licensed  open  source  • QuesIons/comments?  

•  Email:  impala-­‐[email protected]  •  Join:  hTp://groups.cloudera.org  

©2013 Cloudera, Inc. All Rights Reserved. 19

Page 20: Impala: A Modern SQL Engine for Hadoop