cloudera impala - hug karlsruhe, july 04, 2013

18
Cloudera Impala Real Time Query for HDFS and HBase Alexander Alten-Lorenz, Cloudera INC Thursday, July 4, 13

Upload: alexander-alten-lorenz

Post on 26-Jan-2015

106 views

Category:

Technology


1 download

DESCRIPTION

Low latency data processing with Impala Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), JDBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.

TRANSCRIPT

Page 1: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Cloudera ImpalaReal Time Query for HDFS and HBase

Alexander Alten-Lorenz, Cloudera INC

Thursday, July 4, 13

Page 2: Cloudera Impala - HUG Karlsruhe, July 04, 2013

2

Beyond Batch

What is Impala

Capability

Architecture

Demo

Thursday, July 4, 13

Page 3: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Beyond Batch

3

For some things MapReduce is just too slowApache Hive:

MapReduce execution engineHigh-latency, low throughputHigh runtime overhead

Google realized this early on Analysts wanted fast, interactive results

Thursday, July 4, 13

Page 4: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Dremel

4

Google paper (2010)“scalable, interactive ad-hoc query system for analysis of read-only nested data”

Columnar storage formatDistributed scalable aggregation

“capable of running aggregation queries over trillion-row tables in seconds”

http://research.google.com/pubs/pub36632.html

Thursday, July 4, 13

Page 5: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Goals

5

General-purpose SQL query engine for HadoopFor analytical and transactional workloadsSupport queries that take μs to hoursRun directly with Hadoop

Collocated daemonsSame file formatsSame storage managers (NN, metastore)

Thursday, July 4, 13

Page 6: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Goals

6

High performanceC++runtime code generation (LLVM)direct access to data (no MapReduce)

Retain user experience easy for Hive users to migrate100% open-source

Thursday, July 4, 13

Page 7: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Capability

7

HiveQL (subset of SQL92)select, project, join, union, subqueries, aggregation, insert, alter, order by (with limit)DDL

Directly queries data in HDFS & HBaseText files (compressed)Sequence files (snappy/gzip)Avro & Parquet

Thursday, July 4, 13

Page 8: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Capability

8

Familiar and unified platformUses Hive’s metastoreSubmit queries via ODBC | Beeswax Thrift API

Query is distributed to nodes with relevant dataProcess-to-process data exchangeKerberos authenticationNo fault tolerance

Thursday, July 4, 13

Page 9: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Performance

9

Greater disk throughput~100MB/sec/diskI/O-bound workloads faster by 3-4x

Queries that require multiple map-reduce phases in Hive are significantly faster in Impala (up to 45x)Queries that run against in-memory cached data see a significant speedup (up to 90x)

Thursday, July 4, 13

Page 10: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Architecture

10

impaladruns on every nodehandles client requests (ODBC, thrift)handles query planning & execution

statestoredprovides name servicemetadata distributionused for finding data

Thursday, July 4, 13

Page 11: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Architecture

11

Thursday, July 4, 13

Page 12: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Architecture

12

Thursday, July 4, 13

Page 13: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Architecture

13

Thursday, July 4, 13

Page 14: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Impala: Architecture

14

Thursday, July 4, 13

Page 15: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Current limitations

15

1.0.1 (available since May 2013)No SerDesNo User Defined Functions (UDF’s)impalad’s read metastore at startup refresh metadata per command line

Thursday, July 4, 13

Page 16: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Futures

16

DDL support (CREATE)Rudimentary cost-based optimizer (CBO)metadata distribution through statestoredColumnar storage format like Dremel’s

Impala + Parquet = Dremel superset

Thursday, July 4, 13

Page 17: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Demo

17

[email protected]@cloudera.com

@mapreditmapredit.blogspot.com

Web: http://goo.gl/7sxdp

Thursday, July 4, 13

Page 18: Cloudera Impala - HUG Karlsruhe, July 04, 2013

Thursday, July 4, 13