incredible impala

58
1 Impala: Modern, Open-Source SQL Engine For Hadoop Gwen Shapira @gwenshap [email protected]

Upload: chen-gwen-shapira

Post on 11-Aug-2014

2.258 views

Category:

Data & Analytics


1 download

DESCRIPTION

Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax. This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.

TRANSCRIPT

Page 1: Incredible Impala

1

Impala:Modern, Open-Source SQL EngineFor HadoopGwen Shapira@[email protected]

Page 2: Incredible Impala

2

Agenda

• Why Hadoop?• Data Processing in Hadoop• User’s view of Impala• Impala Use Cases• Impala Architecture• Performance highlights

Page 3: Incredible Impala

3

In the beginning….

was the database

Page 4: Incredible Impala

4

For a while, the database was all we needed.

Page 5: Incredible Impala

5

Data is not what it used to beD

ata

Gro

wth

STRUCTURED DATA – 20%

1980 2012

UNSTRUCTURED DATA – 80%

Page 6: Incredible Impala

6

Hadoop was Invented to Solve:

• Large volumes of data• Data that is only valuable in bulk• High ingestion rates• Data that requires more processing• Differently structured data• Evolving data• High license costs

Page 7: Incredible Impala

7

What is Apache Hadoop?

Has the Flexibility to Store and Mine Any Type of Data

Ask questions across structured and unstructured data that were previously impossible to ask or solve

Not bound by a single schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

ScalesEconomically

Can be deployed on commodity hardware

Open source platform guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-Healing, High Bandwidth Clustered

Storage

MapReduce

Distributed Computing Framework

Apache Hadoop is an open source platform for data storage and processing that is…

Distributed Fault tolerant Scalable

CORE HADOOP SYSTEM COMPONENTS

Page 8: Incredible Impala

8

Processing Data in Hadoop

Page 9: Incredible Impala

9

Map Reduce

• Versatile• Flexible• Scalable

• High latency• Batch oriented• Java• Challenging paradigm

Page 10: Incredible Impala

10

Hive & Pig

• Hive – Turn SQL into MapReduce• Pig – Turn execution plans into MapReduce• Makes MapReduce easier• But not any faster

Page 11: Incredible Impala

11

Towards a Better Map Reduce

• Spark – Next generation MapReduceWith in-memory cachingLazy EvaluationFast recovery times from node failures

• Tez – Next generation MapReduce. Reduced overhead, more flexibility.Currently Alpha

Page 12: Incredible Impala

12

And now to something completely different!

Page 13: Incredible Impala

13

What is Impala?

Page 14: Incredible Impala

14

Impala Overview

Interactive SQL for Hadoop Responses in seconds Nearly ANSI-92 standard SQL with Hive SQL

Native MPP Query Engine Purpose-built for low-latency queries Separate runtime from MapReduce Designed as part of the Hadoop ecosystem

Open Source Apache-licensed

Page 15: Incredible Impala

Impala OverviewRuns directly within Hadoop

reads widely used Hadoop file formats talks to widely used Hadoop storage managers runs on same nodes that run Hadoop processes

High performance C++ instead of Java runtime code generation completely new execution engine – No MapReduce

Page 16: Incredible Impala

Beta version released since October 2012 General availability (v1.0) release out since April 2013 Latest release (v1.2.3) released on December 23rd

Impala is Production Ready

Page 17: Incredible Impala

User View of Impala: Overview

• Distributed service in cluster: one Impala daemon on each node with data

• Highly available: no single point of failure• Submit query to any daemon:

• ODBC/JDBC• Impala CLI• Hue

• Query is distributed to all nodes with relevant data• Impala uses Hive’s metadata

Page 18: Incredible Impala

User View of Impala: File Formats

• There is no ‘Impala format’. • Impala supports:

• Uncompressed/lzo-compressed text files• Sequence files and RCFile with snappy/gzip

compression• Avro data files• Parquet columnar format (more on that later)• HBase

Page 19: Incredible Impala

User View of Impala: SQL Support• Most of SQL-92• INSERT INTO … SELECT …• Only equi-joins; no non-equi joins, no cross products• Order By requires Limit (for now)• DDL support• SQL-style authorization via Apache Sentry (incubating)

• UDFs and UDAFs are supported

Page 20: Incredible Impala

20

Use Cases

Page 21: Incredible Impala

21

Impala Use Cases

Interactive BI/analytics on more data

Asking new questions – exploration, ML

Data processing with tight SLAs

Query-able archive w/full fidelity

Cost-effective, ad hoc query environment that offloads the data warehouse for:

Page 22: Incredible Impala

22

Global Financial Services Company

Saved 90% on incremental EDW spend &improved performance by 5x

Offload data warehouse for query-able archive

Store decades of data cost-effectively

Process & analyze on the same system

Improved capabilities through interactive query on more data

Page 23: Incredible Impala

24

Digital Media Company

20x performance improvement for exploration & data discovery

Easily identify new data sets for modeling

Interact with raw data directly to test hypotheses

Avoid expensive DW schema changes

Accelerate ‘time to answer’

Page 24: Incredible Impala

25

Impala Architecture

Page 25: Incredible Impala

Impala Architecture

• Impala daemon (impalad) – N instances• Query execution

• State store daemon (statestored) – 1 instance• Provides name service and metadata distribution

• Catalog daemon (catalogd) – 1 instance• Relays metadata changes to all impalad’s

Page 26: Incredible Impala

27

Impala Query Execution

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBCHive

Metastore HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL request

1) Request arrives via ODBC/JDBC/HUE/Shell

Page 27: Incredible Impala

28

Impala Query Execution

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBCHive

Metastore HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

2) Planner turns request into collections of plan fragments3) Coordinator initiates execution on impalad(s) local to data

Page 28: Incredible Impala

29

Impala Query Execution

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBCHive

Metastore HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

4) Intermediate results are streamed between impalad(s)5) Query results are streamed back to client

Query results

Page 29: Incredible Impala

30

Query Planner

2-phase planning Left deep tree Partition plan to maximize data locality

Join order Before 1.2.3: Order of tables in query. 1.2.3 and above: Cost based if statistics exist

Plan Operators Scan, HashJoin, HashAggregation, Union, TopN, Exchange All operators are fully distributed

Page 30: Incredible Impala

31

Query Execution Example

Page 31: Incredible Impala

Simple Example

SELECT state, SUM(revenue)FROM HdfsTbl h JOIN HbaseTbl b ON (id)GROUP BY state ORDER BY 2 desc LIMIT 10

Page 32: Incredible Impala

How does a database execute a query?

• Left Deep Tree• Data flows from bottom

to top TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 33: Incredible Impala

Wait – Why is this a left-deep tree?

HashJoin

Scan: t1

Scan: t3

Scan: t2

HashJoin

Agg

HashJoin

Scan: t0

Page 34: Incredible Impala

How does a database execute a query?

• Hash Join Node fills the hash table with the RHS table data.

• So, the RHS table (Hbase scan) is scanned first.

TopN

Agg

HashJoin

HdfsScan

HbaseScan

Scan Hbase

first

Page 35: Incredible Impala

How does a database execute a query?

• Hash Join Node fills the hash table with the RHS table data. TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 36: Incredible Impala

How does a database execute a query?

• Hash Join Node fills the hash table with the RHS table data. TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 37: Incredible Impala

How does a database execute a query?

• Hash Join Node fills the hash table with the RHS table data. TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 38: Incredible Impala

How does a database execute a query?

• Start scanning LHS (Hdfs) table

• For each row from LHS, probe the hash table for matching rows

TopN

Agg

HashJoin

HdfsScan

HbaseScan

Probe hash table and a matching row is found.

Page 39: Incredible Impala

How does a database execute a query?

• Matched rows are bubbled up the execution tree TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 40: Incredible Impala

How does a database execute a query?

• Continue scanning the LHS (Hdfs) table

• For each row from LHS, probe the hash table for matching rows

• Unmatched rows are discarded

TopN

Agg

HashJoin

HdfsScan

HbaseScan

No matching row

Page 41: Incredible Impala

How does a database execute a query?

• Continue scanning the LHS (Hdfs) table

• For each row from LHS, probe the hash table for matching rows

• Unmatched rows are discarded

TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 42: Incredible Impala

How does a database execute a query?

• Continue scanning the LHS (Hdfs) table

• For each row from LHS, probe the hash table for matching rows

• Unmatched rows are discarded

TopN

Agg

HashJoin

HdfsScan

HbaseScan

Probe hash table and a matching row is found.

Page 43: Incredible Impala

How does a database execute a query?

• Matched rows are bubbled up the execution tree TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 44: Incredible Impala

How does a database execute a query?

• Continue scanning the LHS (Hdfs) table

• For each row from LHS, probe the hash table for matching rows

• Unmatched rows are discarded

TopN

Agg

HashJoin

HdfsScan

HbaseScan

No matching row

Page 45: Incredible Impala

How does a database execute a query?

• All rows have been returned from the hash join node. Agg node can start returning rows

• Rows are bubbled up the execution tree

TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 46: Incredible Impala

How does a database execute a query?

• Rows from the aggregation node bubbles up to the top-n node

TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 47: Incredible Impala

How does a database execute a query?

• Rows from the aggregation node bubbles up to the top-n node

• When all rows are returned by the agg node, top-n node can restart return rows to the end-user

TopN

Agg

HashJoin

HdfsScan

HbaseScan

Page 48: Incredible Impala

49

Key takeaways Data flows from bottom to top in the execution tree

and finally goes to the end user Larger tables go on the left Collect statistics Filter early

Page 49: Incredible Impala

Simpler Example

SELECT state, SUM(revenue)FROM HdfsTbl h JOIN HbaseTbl b ON (id)GROUP BY state

Page 50: Incredible Impala

How does an MPP database execute a query?

Tbl bScan

HashJoin

Tbl aScan

Exch

Agg

Exch

AggAgg

HashJoin

Tbl aScan

Tbl bScan

Broadcast

Re-distribute by “state”

Page 51: Incredible Impala

How does a MPP database execute a query

A join B

A join B

A join B

Local Agg

Local Agg

Local Agg

Scan and Broadcast Tbl B

Final Agg

Final Agg

Final Agg

Re-distribute by “state”

Local read Tbl A

Page 52: Incredible Impala

53

Performance

Page 53: Incredible Impala

54

Impala Performance Results

• Impala’s Latest Milestone:• Comparable commercial MPP DBMS speed• Natively on Hadoop

• Three Result Sets:• Impala vs Hive 0.12 (Impala 6-70x faster)• Impala vs “DBMS-Y” (Impala average of 2x faster)• Impala scalability (Impala achieves linear scale)

• Background• 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported

language)• Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB)• Realistic nodes (e.g. 8-core CPU, 96GB RAM, 12x2TB disks)• Methodical testing (multiple runs, reviewed fairness for competition, etc)

• Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/

Page 54: Incredible Impala

55

Impala vs Hive 0.12 (Lower bars are better)

Page 55: Incredible Impala

56

Impala vs “DBMS-Y” (Lower bars are better)

Page 56: Incredible Impala

57

Impala Scalability: 2x the Hardware(Expectation: Cut Response Times in Half)

Page 57: Incredible Impala

58

Impala Scalability: 2x the Hardware and 2x Users/Data(Expectation: Constant Response Times)

2x the Users, 2x the Hardware

2x the Data, 2x the Hardware

Page 58: Incredible Impala

59