introduction to cloudera impala
DESCRIPTION
Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale. As presented to Charm City Linux on March 25th 2014. http://www.meetup.com/CharmCityLinux/events/168288632/TRANSCRIPT
1
Cloudera Impala Charm City Linux, March 2014 Alex Moundalexis [email protected] @technmsg
Thirty Seconds About Alex
• SoluEons Architect • aka consultant • government • infrastructure
• former coder of Perl • former administrator • likes shiny objects
2
What Does Cloudera Do?
• product • distribuEon of Hadoop components, Apache licensed • enterprise tooling
• support • training • services (aka consulEng) • community
3
Disclaimer
• Cloudera builds things soPware • most donated to Apache • some closed-‐source
• Cloudera “products” I reference are open source • Apache Licensed • source code is on GitHub
• hVps://github.com/cloudera
4
What This Talk Isn’t About
• deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor
• sizing & tuning • depends heavily on data and workload
• coding • unless you count XML or CSV or SQL
• algorithms
5
6
Quick and dirty, for context.
The Apache Hadoop Ecosystem
Why “Ecosystem?”
• In the beginning, just Hadoop • HDFS • MapReduce
• Today, dozens of interrelated components • I/O • Processing • Specialty ApplicaEons • ConfiguraEon • Workflow
7
HDFS
• Distributed, highly fault-‐tolerant filesystem • OpEmized for large streaming access to data • Based on Google File System
• hVp://research.google.com/archive/gfs.html
8
Lots of Commodity Machines
9
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR)
• Programming paradigm • Batch oriented, not realEme • Works well with distributed compuEng • Lots of Java, but other languages supported • Based on Google’s paper
• hVp://research.google.com/archive/mapreduce.html
10
Under the Covers
11
You specify map() and reduce() functions. ���
���The framework does the
rest. 60
Apache Hive
• AbstracEon of Hadoop’s Java API • HiveQL “compiles” down to MR
• a “SQL-‐like” language
• Eases analysis using MapReduce
13
Apache Hive Metastore
• Maps HDFS files to DB-‐like resources • Databases • Tables • Column/field names, data types • Roles/users • InputFormat/OutputFormat
14
WHY DO WE NEED THIS? But wait…
15
16
17
I am not a SQL wizard by any means…
Super Shady SQL Supplement
A Simple RelaEonal Database
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
18
>
InteracEng with RelaEonal Data
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
19
> SELECT * FROM people;
InteracEng with RelaEonal Data
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
20
> SELECT * FROM people;
RequesEng Specific Fields
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
21
> SELECT name, state FROM people;
RequesEng Specific Fields
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
22
> SELECT name, state FROM people;
RequesEng Specific Rows
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
23
> SELECT name, state FROM people WHERE year < 2012;
RequesEng Specific Rows
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
24
> SELECT name, state FROM people WHERE year < 2012;
Two Simple Tables
owner species name
Alex Cactus Marvin
Joey Cat Brain
Sean None
Paris Unknown
25
>
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
Joining Two Tables
owner species name
Alex Cactus Marvin
Joey Cat Brain
Sean None
Paris Unknown
26
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
Joining Two Tables
owner species name
Alex Cactus Marvin
Joey Cat Brain
Sean None
Paris Unknown
27
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
Joining Two Tables
owner species name
Alex Cactus Marvin
Joey Cat Brain
Sean None
Paris Unknown
28
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
Joining Two Tables
29
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
owner state pet
Alex Maryland Marvin
Joey Maryland Brain
Sean Texas
Paris Maryland
Varying ImplementaEon of JOIN
30
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
owner state pet
Alex Maryland Marvin
Joey Maryland Brain
Sean Texas ?
Paris Maryland ?
31
Familiar interface, but more powerful.
Cloudera Impala
Cloudera Impala
• InteracEve query on Hadoop • think seconds, not minutes
• Nearly ANSI-‐92 standard SQL • compaEble with HiveQL
• NaEve MPP query engine • built for low-‐latency queries
32
Cloudera Impala – Design Choices
• NaEve daemons, wriVen in C/C++ • No JVM, no MapReduce • Saturate disks on reads • Uses in-‐memory HDFS caching
• Re-‐uses Hive metastore • Not as fault-‐tolerant as MapReduce
33
Cloudera Impala – Architecture
• Impala Daemon • runs on every node • handles client requests • handles query planning & execuEon
• State Store Daemon • provides name service • metadata distribuEon • used for finding data
34
Impala Query ExecuEon
35
Query Planner Query Coordinator Query Executor
HDFS DN HBase
SQL App ODBC
Hive Metastore HDFS NN Statestore
Query Planner Query Coordinator Query Executor
HDFS DN HBase
Query Planner Query Coordinator Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/HUE/Shell
Impala Query ExecuEon
36
Query Planner Query Coordinator Query Executor
HDFS DN HBase
SQL App ODBC
Hive Metastore HDFS NN Statestore
Query Planner Query Coordinator Query Executor
HDFS DN HBase
Query Planner Query Coordinator Query Executor
HDFS DN HBase
2) Planner turns request into collecRons of plan fragments 3) Coordinator iniRates execuRon on impalad(s) local to data
Impala Query ExecuEon
37
Query Planner Query Coordinator Query Executor
HDFS DN HBase
SQL App ODBC
Hive Metastore HDFS NN Statestore
Query Planner Query Coordinator Query Executor
HDFS DN HBase
Query Planner Query Coordinator Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client
Query results
Cloudera Impala – Results
• Allows for fast iteraEon/discovery • How much faster?
• 3-‐4x faster on I/O bound workloads • up to 45x faster on mulE-‐MR queries • up to 90x faster on in-‐memory cache
38
39
Hold onto something, folks.
Demo
What’s Next?
• Download Hadoop! • CDH available at www.cloudera.com • Already done that? Contribute…
• Cloudera provides pre-‐loaded VMs • hVp://Eny.cloudera.com/quickstartvm
• Clone our repos! • hVps://github.com/cloudera
40
PARIS Special thanks:
41
42
Preferably related to the talk… or not.
QuesEons?
43
Thank You! Alex Moundalexis [email protected] @technmsg We’re hiring, kids! Well, not kids.