cloudera impala - las vegas big data meetup nov 5th 2014

Cloudera ImpalaLV Big Data Monthly Meetup #1November 5th 2014

Maxime DumasSystems Engineer

Thirty Seconds About Max

• Systems Engineer

• aka Sales Engineer

• SoCal, AZ, NV

• former coder of PHP

• teaches meditation + yoga

• from Montreal, Canada

What Does Cloudera Do?

• product

• distribution of Hadoop components, Apache licensed

• enterprise tooling

• support

• training

• services (aka consulting)

• community

What This Talk Isn’t About

• deploying

• Puppet, Chef, Ansible, homegrown scripts, intern labor

• sizing & tuning

• depends heavily on data and workload

• coding

• unless you count XML or CSV or SQL

• algorithms

What is Cloudera Impala?

Public Domain IFCAR

cloud·e·ra im·pal·a

/kloudˈi(ə)rə imˈpalə/

a modern, open source, MPP SQL query engine for Apache Hadoop.

“Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complementing traditional MapReduce batch processing.”

Impala adoption

Component (and Founder) Vendor Support

Cloudera MapR Amazon IBM Pivotal Hortonworks

Impala (Cloudera) ✔ ✔ ✔ X X X

Hue (Cloudera) ✔ ✔ X X X ✔

Sentry (Cloudera) ✔ ✔ X ✔ ✔ X

Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔

Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X

Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔

Ambari (Hortonworks) X X X X ✔ ✔

Knox (Hortonworks) X X X X X ✔

Tez (Hortonworks) X X X X X ✔

Drill (MapR) X ✔ X X X X

Quick and dirty, for context.

The Apache Hadoop Ecosystem

reserved.

• Scalability• Simply scales just by adding nodes• Local processing to avoid network bottlenecks

• Efficiency• Cost efficiency (<$1k/TB) on commodity hardware• Unified storage, metadata, security (no duplication or synchronization)

• Flexibility• All kinds of data (blobs, documents, records, etc)• In all forms (structured, semi-structured, unstructured)• Store anything then later analyze what you need

Why Hadoop?

Why “Ecosystem?”

• In the beginning, just Hadoop• HDFS

• MapReduce

• Today, dozens of interrelated components• I/O

• Processing

• Specialty Applications

• Configuration

• Workflow

• Distributed, highly fault-tolerant filesystem

• Optimized for large streaming access to data

• Based on Google File System

• http://research.google.com/archive/gfs.html

Lots of Commodity Machines

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce (MR)

• Programming paradigm

• Batch oriented, not realtime

• Works well with distributed computing

• Lots of Java, but other languages supported

• Based on Google’s paper

• http://research.google.com/archive/mapreduce.html

Apache Hive

• Abstraction of Hadoop’s Java API

• HiveQL “compiles” down to MR

• a “SQL-like” language

• Eases analysis using MapReduce

Apache Hive Metastore

• Maps HDFS files to DB-like resources

• Databases

• Tables

• Column/field names, data types

• Roles/users

• InputFormat/OutputFormat

Architecture

reserved.

3RD PARTYAPPS

STORAGE FOR ANY TYPE OF DATAUNIFIED, ELASTIC, RESILIENT, SECURE

CLOUDERA’S ENTERPRISE DATA HUB

BATCHPROCESSING

MAPREDUCE, SPARK

ANALYTICSQL

IMPALA

SEARCH

MACHINELEARNING

STREAMPROCESSING

WORKLOAD MANAGEMENT YARN

FILESYSTEM

ONLINE NOSQL

SYSTEMM

SENTRY

PARTNERS, MAHOUT

WHY DO WE NEED THIS?But wait…

Familiar interface, but more powerful.

Cloudera Impala

• Interactive query on Hadoop

• think seconds, not minutes

• ANSI-92 standard SQL

• compatible with HiveQL

• Native MPP query engine

• built for low-latency queries

• HDFS and HBase storage

Cloudera Impala – Design Choices

• Native daemons, written in C/C++

• No JVM, no MapReduce

• Saturate disks on reads

• Uses in-memory HDFS caching

• Re-uses Hive metastore

• Not as fault-tolerant as MapReduce

Benefits of ImpalaUnlocks BI/analytics on Hadoop

• Interactive SQL in seconds• Highly concurrent to handle 100s of users

Native Hadoop flexibility• No data migration, conversion, or duplication required• Query existing Hadoop data• Run multiple frameworks on the same data at the same time• Supports Parquet for best-of-breed columnar performance

Native MPP query engine designed into Hadoop:• Unified Hadoop storage• Unified Hadoop metadata (uses Hive and HCatalog)• Unified Hadoop security• Fine-grained role-based access controls with Sentry

Apache-licensed open source

Proven in Production

Cloudera Impala – Architecture

• Impala Daemon• runs on every node

• handles client requests

• handles query planning & execution

• State Store Daemon• provides name service

• metadata distribution

• used for finding data

Impala Query Execution

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

HiveMetastore

HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL request

1) Request arrives via ODBC/JDBC/HUE/Shell

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

HiveMetastore

HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

2) Planner turns request into collections of plan fragments3) Coordinator initiates execution on impalad(s) local to data

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

HiveMetastore

HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

4) Intermediate results are streamed between impalad(s)5) Query results are streamed back to client

Query results

Cloudera Impala – Results

• Allows for fast iteration/discovery

• How much faster?

• 3-4x faster on I/O bound workloads

• up to 45x faster on multi-MR queries

• up to 90x faster on in-memory cache

Impala Spark SQL Presto Hive-on-Tez

s)Single User vs 10 User Response Time/Impala

Times Faster(Lower bars = better)

Latest SQL Performance

Independent validation by IBM Research SQL-on-Hadoop VLDB paper:“Impala’s database architecture provides significant performance gains”

Previous Milestones

Impala 1.0 (GA)

Impala 1.1 (Security)

Impala 1.2 (Usability)

Impala 1.3 (Resource

Management)

Impala 1.4 (Extensibility)

Impala 2.0 (SQL)

Spring2013

Summer 2013

Fall2013

Spring2014

Summer2014

Fall2014

Cloudera Impala 2.0

Window Functions“Aggregate function applied to a partition of the result set” (SQL 2003)Ex:sum(population) OVER (PARTITION BY city)rank() OVER (PARTITION BY state, ORDER BY population)

We’ve implemented most of the spec• PARTITION BY, ORDER BY• WINDOW

• PRECEEDING, FOLLOWING• ROWS

• Any number of analytic functions in one query

Cloudera Impala 2.0

Subqueries

A query that is part of another query. Ex:select col from t1

where col in

(select c2 from t2)

Support:

• Correlated and uncorrelated subqueries.

• IN, NOT IN, EXISTS, NOT EXISTS

Cloudera Impala 2.0

Spill to disk joins & aggregations

• Previously, if a query ran out of memory, Impala would abort it• This means some big joins (fact table – fact table) joins could never run.

• All operators that accumulate memory can now spill to disk if necessary.

• Order by (Impala 1.4)

• Join/Agg (Impala 2.0)

• Analytic Functions (Impala 2.0)

• Transparent to existing workloads

Cloudera Impala 2.1 +

• Nested data – enables queries on complex nested structures including maps, structs, and arrays (early 2015)

• MERGE statement – enables merging in updates into existing tables• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET• SQL SET operators – MINUS, INTERSECT• Apache HBase CRUD – allows use of Impala for inserts and updates into HBase• UDTFs (user-defined table functions) – for more advanced user functions and

extensibility• Intra-node parallelized aggregations and joins – to provide even faster joins and

aggregations on on top of the performance gains of Impala• Parquet enhancements – continued performance gains including index pages• Amazon S3 integration

Hold onto something, folks.

Quick Demo

reserved.

Apache-licensed open source

• Download: cloudera.com/downloads

• Email: impala-user@cloudera.org

• Join: groups.cloudera.org

Cloudera Live

Free, Interactive Tutorials at cloudera.com/live

Try It Out

LAS VEGAS BIG DATA Special thanks:

Preferably related to the talk… or not.

Questions?

Thank You!Maxime Dumas

mdumas@cloudera.com

We’re hiring.

cloudera impala - las vegas big data meetup nov 5th 2014

apache hadoop ecosystemquick

sql querycapability

sql algorithms4

impala adoption8component

aka sales engineer socal

maxime dumassystems

nodes local processing

open source

Software

impala ha with f5 big-ip - cloudera

apache atlas reference - cloudera · cloudera, cloudera...

cloudera impala technical deep dive

tibco spotfire® connector for cloudera impala...release...

cloudera odbc driver for impala installation and...

impala: a modern sql engine for hadoop - meetup

cloudera impala overview (via scott leberknight)

cloudera impala, updated for v1.0

the tibco insight platform · hadoop, cloudera,...

cloudera odbc driver for impala installation and...

cloudera impala internals

cloudera impala: a modern sql engine for apache hadoop

cloudera impala - hug karlsruhe, july 04, 2013

cloudera jdbc driver for impala installation and...

hug meetup impala 2.5 performance overview

cloudera jdbc driver for impala installation and ......

cloudera odbc driver for impala install guide

cloudera jdbc driver for impala installation and ......

setting up a hadoop cluster with cloudera manager and impala

cloudera impala - san diego big data meetup august 13th 2014