1
Cloudera ImpalaLV Big Data Monthly Meetup #1November 5th 2014
Maxime DumasSystems Engineer
Thirty Seconds About Max
• Systems Engineer
• aka Sales Engineer
• SoCal, AZ, NV
• former coder of PHP
• teaches meditation + yoga
• from Montreal, Canada
2
What Does Cloudera Do?
• product
• distribution of Hadoop components, Apache licensed
• enterprise tooling
• support
• training
• services (aka consulting)
• community
3
What This Talk Isn’t About
• deploying
• Puppet, Chef, Ansible, homegrown scripts, intern labor
• sizing & tuning
• depends heavily on data and workload
• coding
• unless you count XML or CSV or SQL
• algorithms
4
What is Cloudera Impala?
5
Public Domain IFCAR
cloud·e·ra im·pal·a
7
/kloudˈi(ə)rə imˈpalə/
noun
a modern, open source, MPP SQL query engine for Apache Hadoop.
“Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complementing traditional MapReduce batch processing.”
Impala adoption
8
Component (and Founder) Vendor Support
Cloudera MapR Amazon IBM Pivotal Hortonworks
Impala (Cloudera) ✔ ✔ ✔ X X X
Hue (Cloudera) ✔ ✔ X X X ✔
Sentry (Cloudera) ✔ ✔ X ✔ ✔ X
Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔
Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔
Ambari (Hortonworks) X X X X ✔ ✔
Knox (Hortonworks) X X X X X ✔
Tez (Hortonworks) X X X X X ✔
Drill (MapR) X ✔ X X X X
9
Quick and dirty, for context.
The Apache Hadoop Ecosystem
©2014 Cloudera, Inc. All rights
reserved.
• Scalability• Simply scales just by adding nodes• Local processing to avoid network bottlenecks
• Efficiency• Cost efficiency (<$1k/TB) on commodity hardware• Unified storage, metadata, security (no duplication or synchronization)
• Flexibility• All kinds of data (blobs, documents, records, etc)• In all forms (structured, semi-structured, unstructured)• Store anything then later analyze what you need
Why Hadoop?
Why “Ecosystem?”
• In the beginning, just Hadoop• HDFS
• MapReduce
• Today, dozens of interrelated components• I/O
• Processing
• Specialty Applications
• Configuration
• Workflow
11
HDFS
• Distributed, highly fault-tolerant filesystem
• Optimized for large streaming access to data
• Based on Google File System
• http://research.google.com/archive/gfs.html
12
Lots of Commodity Machines
13
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR)
• Programming paradigm
• Batch oriented, not realtime
• Works well with distributed computing
• Lots of Java, but other languages supported
• Based on Google’s paper
• http://research.google.com/archive/mapreduce.html
14
Apache Hive
• Abstraction of Hadoop’s Java API
• HiveQL “compiles” down to MR
• a “SQL-like” language
• Eases analysis using MapReduce
15
Apache Hive Metastore
• Maps HDFS files to DB-like resources
• Databases
• Tables
• Column/field names, data types
• Roles/users
• InputFormat/OutputFormat
16
Architecture
©2014 Cloudera, Inc. All rights
reserved.
3RD PARTYAPPS
STORAGE FOR ANY TYPE OF DATAUNIFIED, ELASTIC, RESILIENT, SECURE
CLOUDERA’S ENTERPRISE DATA HUB
BATCHPROCESSING
MAPREDUCE, SPARK
ANALYTICSQL
IMPALA
SEARCH
SOLR
MACHINELEARNING
STREAMPROCESSING
SPARK
WORKLOAD MANAGEMENT YARN
FILESYSTEM
HDFS
ONLINE NOSQL
HBASE
DA
TAM
AN
AG
EMEN
TC
LOU
DER
A N
AV
IGA
TOR
SYSTEMM
AN
AG
EMEN
TC
LOU
DER
A M
AN
AG
ER
SENTRY
PARTNERS, MAHOUT
WHY DO WE NEED THIS?But wait…
18
19
20
Familiar interface, but more powerful.
Cloudera Impala
Cloudera Impala
• Interactive query on Hadoop
• think seconds, not minutes
• ANSI-92 standard SQL
• compatible with HiveQL
• Native MPP query engine
• built for low-latency queries
• HDFS and HBase storage
21
Cloudera Impala – Design Choices
• Native daemons, written in C/C++
• No JVM, no MapReduce
• Saturate disks on reads
• Uses in-memory HDFS caching
• Re-uses Hive metastore
• Not as fault-tolerant as MapReduce
22
Benefits of ImpalaUnlocks BI/analytics on Hadoop
• Interactive SQL in seconds• Highly concurrent to handle 100s of users
Native Hadoop flexibility• No data migration, conversion, or duplication required• Query existing Hadoop data• Run multiple frameworks on the same data at the same time• Supports Parquet for best-of-breed columnar performance
Native MPP query engine designed into Hadoop:• Unified Hadoop storage• Unified Hadoop metadata (uses Hive and HCatalog)• Unified Hadoop security• Fine-grained role-based access controls with Sentry
Apache-licensed open source
Proven in Production
23
Cloudera Impala – Architecture
• Impala Daemon• runs on every node
• handles client requests
• handles query planning & execution
• State Store Daemon• provides name service
• metadata distribution
• used for finding data
24
Impala Query Execution
25
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
HiveMetastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/HUE/Shell
Impala Query Execution
26
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
HiveMetastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collections of plan fragments3) Coordinator initiates execution on impalad(s) local to data
Impala Query Execution
27
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
HiveMetastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s)5) Query results are streamed back to client
Query results
Cloudera Impala – Results
• Allows for fast iteration/discovery
• How much faster?
• 3-4x faster on I/O bound workloads
• up to 45x faster on multi-MR queries
• up to 90x faster on in-memory cache
28
0
50
100
150
200
250
300
350
Impala Spark SQL Presto Hive-on-Tez
Tim
e (
in s
eco
nd
s)Single User vs 10 User Response Time/Impala
Times Faster(Lower bars = better)
Latest SQL Performance
Sin
gle
Use
r, 5
10
Use
rs, 1
1
Sin
gle
Use
r, 2
5
10
Use
rs, 1
20
10
Use
rs, 3
02
10
Use
rs, 2
02
Sin
gle
Use
r, 3
7
Sin
gle
Use
r, 7
7
5.0x
10.6x
7.4x
27.4x
15.4x
18.3x
Independent validation by IBM Research SQL-on-Hadoop VLDB paper:“Impala’s database architecture provides significant performance gains”
29
Previous Milestones
Impala 1.0 (GA)
Impala 1.1 (Security)
Impala 1.2 (Usability)
Impala 1.3 (Resource
Management)
Impala 1.4 (Extensibility)
Impala 2.0 (SQL)
An
alyt
ic D
atab
ase
C
apab
iliti
es
Spring2013
Summer 2013
Fall2013
Spring2014
Summer2014
Fall2014
30
Cloudera Impala 2.0
Window Functions“Aggregate function applied to a partition of the result set” (SQL 2003)Ex:sum(population) OVER (PARTITION BY city)rank() OVER (PARTITION BY state, ORDER BY population)
We’ve implemented most of the spec• PARTITION BY, ORDER BY• WINDOW
• PRECEEDING, FOLLOWING• ROWS
• Any number of analytic functions in one query
31
Cloudera Impala 2.0
Subqueries
A query that is part of another query. Ex:select col from t1
where col in
(select c2 from t2)
Support:
• Correlated and uncorrelated subqueries.
• IN, NOT IN, EXISTS, NOT EXISTS
32
Cloudera Impala 2.0
Spill to disk joins & aggregations
• Previously, if a query ran out of memory, Impala would abort it• This means some big joins (fact table – fact table) joins could never run.
• All operators that accumulate memory can now spill to disk if necessary.
• Order by (Impala 1.4)
• Join/Agg (Impala 2.0)
• Analytic Functions (Impala 2.0)
• Transparent to existing workloads
33
Cloudera Impala 2.1 +
34
• Nested data – enables queries on complex nested structures including maps, structs, and arrays (early 2015)
• MERGE statement – enables merging in updates into existing tables• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET• SQL SET operators – MINUS, INTERSECT• Apache HBase CRUD – allows use of Impala for inserts and updates into HBase• UDTFs (user-defined table functions) – for more advanced user functions and
extensibility• Intra-node parallelized aggregations and joins – to provide even faster joins and
aggregations on on top of the performance gains of Impala• Parquet enhancements – continued performance gains including index pages• Amazon S3 integration
35
Hold onto something, folks.
Quick Demo
©2014 Cloudera, Inc. All rights
reserved.
Apache-licensed open source
• Download: cloudera.com/downloads
• Email: [email protected]
• Join: groups.cloudera.org
Cloudera Live
Free, Interactive Tutorials at cloudera.com/live
Try It Out
LAS VEGAS BIG DATA Special thanks:
37
38
Preferably related to the talk… or not.
Questions?