lower tco - running bi projects with impala
TRANSCRIPT
Thirty Seconds About Alex
• SoluGons Architect • aka consultant • government • infrastructure
• former coder of Perl • former administrator • fan of Portland
2
What Does Cloudera Do?
• product • distribuGon of Hadoop components, Apache licensed • enterprise tooling
• support • training • services (aka consulGng) • community
3
Disclaimer
• Cloudera builds things soPware • most donated to Apache • some closed-‐source
• Cloudera “products” I reference are open source • Apache Licensed • source code is on GitHub
• hVps://github.com/cloudera
4
What This Talk Isn’t About
• deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor
• sizing & tuning • depends heavily on data and workload
• coding • unless you count XML or CSV or SQL
• algorithms
5
cloud·∙e·∙ra im·∙pal·∙a
8
/kloudˈi(ə)rə imˈpalə/ noun
a modern, open source, MPP SQL query engine for Apache Hadoop. “Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complemenGng tradiGonal MapReduce batch processing.”
Why “Ecosystem?”
• In the beginning, just Hadoop • HDFS • MapReduce
• Today, dozens of interrelated components • I/O • Processing • Specialty ApplicaGons • ConfiguraGon • Workflow
10
HDFS
• Distributed, highly fault-‐tolerant filesystem • OpGmized for large streaming access to data • Based on Google File System
• hVp://research.google.com/archive/gfs.html
11
Lots of Commodity Machines
12
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR)
• Programming paradigm • Batch oriented, not realGme • Works well with distributed compuGng • Lots of Java, but other languages supported • Based on Google’s paper
• hVp://research.google.com/archive/mapreduce.html
13
Apache Hive
• AbstracGon of Hadoop’s Java API • HiveQL “compiles” down to MR
• a “SQL-‐like” language
• Eases analysis using MapReduce
16
Apache Hive Metastore
• Maps HDFS files to DB-‐like resources • Databases • Tables • Column/field names, data types • Roles/users • InputFormat/OutputFormat
17
A Simple RelaGonal Database
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
21
>
InteracGng with RelaGonal Data
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
22
> SELECT * FROM people;
InteracGng with RelaGonal Data
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
23
> SELECT * FROM people;
RequesGng Specific Fields
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
24
> SELECT name, state FROM people;
RequesGng Specific Fields
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
25
> SELECT name, state FROM people;
RequesGng Specific Rows
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
26
> SELECT name, state FROM people WHERE year < 2012;
RequesGng Specific Rows
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
27
> SELECT name, state FROM people WHERE year < 2012;
Two Simple Tables
owner species name
Alex Cactus Marvin
Joey Cat Brain
Sean None
Paris Unknown
28
>
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
Joining Two Tables
owner species name
Alex Cactus Marvin
Joey Cat Brain
Sean None
Paris Unknown
29
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
Joining Two Tables
owner species name
Alex Cactus Marvin
Joey Cat Brain
Sean None
Paris Unknown
30
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
Joining Two Tables
owner species name
Alex Cactus Marvin
Joey Cat Brain
Sean None
Paris Unknown
31
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
name state employer year
Alex Maryland Cloudera 2013
Joey Maryland Cloudera 2011
Sean Texas Cloudera 2013
Paris Maryland AOL 2011
Joining Two Tables
32
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
owner state pet
Alex Maryland Marvin
Joey Maryland Brain
Sean Texas
Paris Maryland
Varying ImplementaGon of JOIN
33
> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner
owner state pet
Alex Maryland Marvin
Joey Maryland Brain
Sean Texas ?
Paris Maryland ?
Cloudera Impala
• InteracGve query on Hadoop • think seconds, not minutes
• Nearly ANSI-‐92 standard SQL • compaGble with HiveQL
• NaGve MPP query engine • built for low-‐latency queries
35
Cloudera Impala – Design Choices
• NaGve daemons, wriVen in C/C++ • No JVM, no MapReduce • Saturate disks on reads • Uses in-‐memory HDFS caching
• Re-‐uses Hive metastore • Not as fault-‐tolerant as MapReduce
36
Cloudera Impala – Architecture
• Impala Daemon • runs on every node • handles client requests • handles query planning & execuGon
• State Store Daemon • provides name service • metadata distribuGon • used for finding data
37
Impala Query ExecuGon
38
Query Planner Query Coordinator Query Executor
HDFS DN HBase
SQL App ODBC
Hive Metastore HDFS NN Statestore
Query Planner Query Coordinator Query Executor
HDFS DN HBase
Query Planner Query Coordinator Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/HUE/Shell
Impala Query ExecuGon
39
Query Planner Query Coordinator Query Executor
HDFS DN HBase
SQL App ODBC
Hive Metastore HDFS NN Statestore
Query Planner Query Coordinator Query Executor
HDFS DN HBase
Query Planner Query Coordinator Query Executor
HDFS DN HBase
2) Planner turns request into collecPons of plan fragments 3) Coordinator iniPates execuPon on impalad(s) local to data
Impala Query ExecuGon
40
Query Planner Query Coordinator Query Executor
HDFS DN HBase
SQL App ODBC
Hive Metastore HDFS NN Statestore
Query Planner Query Coordinator Query Executor
HDFS DN HBase
Query Planner Query Coordinator Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client
Query results
Cloudera Impala – Results
• Allows for fast iteraGon/discovery • How much faster?
• 3-‐4x faster on I/O bound workloads • up to 45x faster on mulG-‐MR queries • up to 90x faster on in-‐memory cache
41
What’s Next?
• Download Hadoop! • CDH available at www.cloudera.com • Already done that? Contribute…
• Cloudera provides pre-‐loaded VMs • hVp://Gny.cloudera.com/quickstartvm
• Clone our repos! • hVps://github.com/cloudera
43