cloudera impala - san diego big data meetup august 13th 2014
DESCRIPTION
Cloudera Impala presentation to San Diego Big Data Meetup (http://www.meetup.com/sdbigdata/events/189420582/)TRANSCRIPT
1
Cloudera Impala SD Big Data Monthly Meetup #2 August 13th 2014 Maxime Dumas Systems Engineer
Thirty Seconds About Max
• Systems Engineer • aka Sales Engineer • SoCal, AZ, NV
• former coder of PHP • teaches meditaLon + yoga • from Montreal, Canada
2
What Does Cloudera Do?
• product • distribuLon of Hadoop components, Apache licensed • enterprise tooling
• support • training • services (aka consulLng) • community
3
What This Talk Isn’t About
• deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor
• sizing & tuning • depends heavily on data and workload
• coding • unless you count XML or CSV or SQL
• algorithms
4
Public Domain IFCAR
What is Cloudera Impala?
6
cloud·∙e·∙ra im·∙pal·∙a
7
/kloudˈi(ə)rə imˈpalə/ noun
a modern, open source, MPP SQL query engine for Apache Hadoop. “Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complemenLng tradiLonal MapReduce batch processing.”
8
Quick and dirty, for context.
The Apache Hadoop Ecosystem
Why “Ecosystem?”
• In the beginning, just Hadoop • HDFS • MapReduce
• Today, dozens of interrelated components • I/O • Processing • Specialty ApplicaLons • ConfiguraLon • Workflow
9
HDFS
• Distributed, highly fault-‐tolerant filesystem • OpLmized for large streaming access to data • Based on Google File System
• hjp://research.google.com/archive/gfs.html
10
Lots of Commodity Machines
11
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR)
• Programming paradigm • Batch oriented, not realLme • Works well with distributed compuLng • Lots of Java, but other languages supported • Based on Google’s paper
• hjp://research.google.com/archive/mapreduce.html
12
Apache Hive
• AbstracLon of Hadoop’s Java API • HiveQL “compiles” down to MR
• a “SQL-‐like” language
• Eases analysis using MapReduce
13
Apache Hive Metastore
• Maps HDFS files to DB-‐like resources • Databases • Tables • Column/field names, data types • Roles/users • InputFormat/OutputFormat
14
Sqoop
©2011 Cloudera, Inc. All Rights Reserved. 15
• SQL to Hadoop
• Tool to import/export any JDBC-‐supported database into Hadoop
• Transfer data between Hadoop and external databases or EDW
• High performance connectors for some RDBMS
• Oracle, Teradata, Netezza
• Developed at Cloudera
16
17
Familiar interface, but more powerful.
Cloudera Impala
Cloudera Impala
18
Interac(ve SQL for Hadoop § Responses in seconds § Nearly ANSI-‐92 standard SQL with Hive SQL
Na(ve MPP Query Engine § Purpose-‐built for low-‐latency queries § Separate runLme from MapReduce § Designed as part of the Hadoop ecosystem
Open Source § Apache-‐licensed
Benefits of Impala
19
More & Faster Value from “Big Data” § InteracLve BI/AnalyLcs experience via SQL § No delays from data migraLon
Flexibility § Query across exisLng data § Select best-‐fit file formats (Parquet, Avro, etc.) § Run mulLple frameworks on the same data at the same Lme
Cost Efficiency § Reduce movement, duplicate storage & compute § 10% to 1% the cost of analyLc DBMS
Full Fidelity Analysis § No loss from aggregaLons or fixed schemas
Impala Use Cases
20
InteracLve BI/analyLcs on more data
Asking new quesLons – exploraLon, ML
Data processing with Lght SLAs
Query-‐able archive w/full fidelity
Cost-‐effec(ve, ad hoc query environment that offloads the data warehouse for:
Our Design Strategy
21
One pool of (open) data
One metadata model
One security framework
One set of system resources
An Integrated Part of the Hadoop System
In-‐Memory Processing & Streaming
Spark
Storage
Integra(on
Resource Management
Metad
ata
Batch Processing MAPREDUCE, HIVE & PIG
…
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Engines
InteracLve SQL
CLOUDERA IMPALA
InteracLve Search CLOUDERA SEARCH
Machine Learning MAHOUT,
ClouderaML, Oryx
Math & Sta(s(cs
SAS, R
Security
Impala Key Features
22
Fast Flexible Secure
Easy to Implement Easy to Use Simple to Manage
§ In-‐memory data transfers § ParLLoned joins
§ Fully distributed aggregaLons
§ Query data in HDFS & HBase § Supports mul(ple file formats
& compression algorithms
§ Java & Na(ve UDFs, UDAFs
§ Integrated with Hadoop security
§ Kerberos authenLcaLon
§ Authoriza(on (Sentry)
§ Leverages Hive’s ODBC/JDBC connectors, metastore & SQL syntax
§ Open source
§ Interact with data via SQL § CerLfied with leading BI tools
§ Deploy, configure & monitor with Cloudera Manager
§ Integrated with Hadoop resource management
What’s Coming?*
23
SQL 2003-‐Compliant AnalyLc Window FuncLons
AddiLonal AuthenLcaLon Mechanisms
User Defined Table FuncLons
Intra-‐node Parallelized AggregaLons & Joins
Nested Data
Enhanced YARN-‐Integrated Resource Manager
Dynamic ParLLon Pruning
In the Near Term:
*On the roadmap… no guarantees
Impala Plays Well with Others
24
BI Partners: Building on the
Enterprise Standard POWERED BY
IMPALA
Not All SQL On Hadoop Is Created Equal
25
Batch MapReduce Make MapReduce faster
Slow, s(ll batch
Remote Query Pull data from HDFS over the network to the DW
compute layer
Slow, expensive
Siloed DBMS Load data into a
proprietary database file
Rigid, siloed data, slow ETL
Impala Na(ve MPP query engine that’s integrated into
Hadoop
Fast, flexible, cost-‐effec(ve
$
DMBS Hadoop
More Detail On AlternaLve Approaches
26
Batch MapReduce
§ Batch-‐oriented § High latency
Remote Query Siloed DBMS
Hadoop DMBS
HDFS Storage
Compute Compute
§ Network bojleneck § 2x the hardware § Duplicate metadata, security, SQL, etc.
Storage (HDFS)
Integra(on
Resource Management
Hado
op M
etad
ata
DBMS
Hadoop Engines
MAPREDUCE, HIVE, PIG, IMPALA, ETC.
DBMS Metad
ata
PROPRIETARY STANDARD & SHARED
§ RDBMS rigidity § Query subset of data § Duplicate storage, metadata, security, SQL, etc.
Storage
Integra(on
Resource Management
Metad
ata
Batch Processing
… InteracLve SQL
Machine Learning
HDFS HBase
Security Security
Other Sexy New Big Data MPP Tools
27
Presto Purpose-‐Built MPP Engine; Similar Architecture to Impala; Few Performance Comparisons, but Impala Anecdotally 5x-‐10x Faster
Shark Hive-‐CompaLble Data Warehouse for Spark; Great Performance unLl Required to go to Disk, at Which Point Impala Bejer; With HDFS Caching Impala will Perform on Par from a Memory PerspecLve
Drill Open Source version of Dremel; Another MPP Engine; MulLple Data Formats and Sources
Phoenix – Sort Of SQL Skin over HBase (and Only HBase); Subset of SQL Standard
What About an EDW/RDBMS?
“Right Tool for the Right Job” EDW/RDBMS Great For:
• OLTP’s complex transacLons • Highly planned and opLmized known workloads • Opera'onal reports and repeated known queries
Impala Great For:
• Exploratory analy'cs with previously-‐unknown queries • Queries on big and growing data sets
EDW/RDBMS Can’t: • Dump in raw data then later define schema and query what you want • Evolve schemas without an expensive schema upgrade planning process • Simply scale just by adding industry-‐standard servers • Store at < $1k/TB instead of $10-‐150k/TB
28
29
Impala Technical Details
The Impala Advantage
30
No MapReduce; No JVM; All NaLve
In-‐Memory Data Transfers
Saturate Disks on Reads
OpLmized File Format (ie Parquet)
In-‐Memory HDFS Caching Cost-‐Based Join Order OpLmizaLon – Frees User from Having to Guess the Correct Join Order
Where does the Performance Come From?
Impala and Hive
31
Shares Everything Client-‐Facing § Metadata (table definiLons) § ODBC/JDBC drivers § SQL syntax (Hive SQL) § Flexible file formats § Machine pool § Hue GUI
But Built for Different Purposes § Hive: runs on MapReduce and ideal for batch processing
§ Impala: naLve MPP query engine ideal for interacLve SQL
Storage
Integra(on
Resource Management
Metad
ata
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Hive SQL Syntax Impala
SQL Syntax + Compute Framework MapReduce
Compute Framework
Batch Processing
InteracLve
SQL
Impala Query ExecuLon
32
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/HUE/Shell
Impala Query ExecuLon
33
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collec(ons of plan fragments 3) Coordinator ini(ates execu(on on impalad(s) local to data
Impala Query ExecuLon
34
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client
Query results
Parquet File Format
35
Open source, columnar Hadoop file format developed by Cloudera & Twiler Limits the IO to only the data that is needed
Supports storing each column in a separate file
Saves space: columnar layout compresses bejer
Enables bejer scans: load only the columns that are needed
Supports index pages for fast lookup
Extensible value encodings
36
Impala Performance Results
Impala Performance Results
• Impala’s Milestone in Jan 2014: • Comparable commercial MPP DBMS speed • NaLvely on Hadoop
• Three Result Sets: • Impala vs Hive 0.12 (Impala 6-‐70x faster) • Impala vs “DBMS-‐Y” (Impala average of 2x faster) • Impala scalability (Impala achieves linear scale)
• Background • 20 pre-‐selected, diverse TPC-‐DS queries (modified to remove unsupported
language) • Sufficient data scale for realisLc comparison (3 TB, 15 TB, and 30 TB) • RealisLc nodes (e.g. 8-‐core CPU, 96GB RAM, 12x2TB disks) • Methodical tesLng (mulLple runs, reviewed fairness for compeLLon, etc)
• Details: hjp://blog.cloudera.com/blog/2014/01/impala-‐performance-‐dbms-‐class-‐speed/
37
Enough slides… DEMO TIME!
38
So What is Cloudera Impala?
39
What’s Next?
• Download Hadoop! • CDH available at www.cloudera.com • Try it online: Cloudera Live
• Cloudera provides pre-‐loaded VMs • hjp://Lny.cloudera.com/quickstartvm
• Ride Impala! • hjp://impala.io/
40
41
SAN DIEGO BIG DATA
Special thanks:
42
Preferably related to the talk… or not.
QuesLons?
43
Thank You! Maxime Dumas [email protected] We’re hiring.
44