cloudera impala

Cloudera Impala Portland Big Data User Group, July 2014 Alex Moundalexis @technmsg

Thirty Seconds About Alex

•  SoluGons Architect •  aka consultant •  government •  infrastructure

•  former coder of Perl •  former administrator •  fan of Portland

What Does Cloudera Do?

•  product •  distribuGon of Hadoop components, Apache licensed •  enterprise tooling

•  support •  training •  services (aka consulGng) •  community

Disclaimer

•  Cloudera builds things soPware •  most donated to Apache •  some closed-‐source

•  Cloudera “products” I reference are open source •  Apache Licensed •  source code is on GitHub

•  hVps://github.com/cloudera

What This Talk Isn’t About

•  deploying •  Puppet, Chef, Ansible, homegrown scripts, intern labor

•  sizing & tuning •  depends heavily on data and workload

•  coding •  unless you count XML or CSV or SQL

•  algorithms

Public Domain IFCAR

CC BY-‐SA Lilian De Cassai

cloud·∙e·∙ra im·∙pal·∙a

/kloudˈi(ə)rə imˈpalə/ noun

a modern, open source, MPP SQL query engine for Apache Hadoop. “Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complemenGng tradiGonal MapReduce batch processing.”

Quick and dirty, for context.

The Apache Hadoop Ecosystem

Why “Ecosystem?”

•  In the beginning, just Hadoop •  HDFS •  MapReduce

•  Today, dozens of interrelated components •  I/O •  Processing •  Specialty ApplicaGons •  ConfiguraGon •  Workflow

•  Distributed, highly fault-‐tolerant filesystem •  OpGmized for large streaming access to data •  Based on Google File System

•  hVp://research.google.com/archive/gfs.html

Lots of Commodity Machines

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce (MR)

•  Programming paradigm •  Batch oriented, not realGme •  Works well with distributed compuGng •  Lots of Java, but other languages supported •  Based on Google’s paper

•  hVp://research.google.com/archive/mapreduce.html

Under the Covers

You specify map() and reduce() functions. ��

��The framework does the

rest. 60

Apache Hive

•  AbstracGon of Hadoop’s Java API •  HiveQL “compiles” down to MR

•  a “SQL-‐like” language

•  Eases analysis using MapReduce

Apache Hive Metastore

•  Maps HDFS files to DB-‐like resources •  Databases •  Tables •  Column/field names, data types •  Roles/users •  InputFormat/OutputFormat

WHY DO WE NEED THIS? But wait…

I am not a SQL wizard by any means…

Super Shady SQL Supplement

A Simple RelaGonal Database

name state employer year

Alex Maryland Cloudera 2013

Joey Maryland Cloudera 2011

Sean Texas Cloudera 2013

Paris Maryland AOL 2011

InteracGng with RelaGonal Data

> SELECT * FROM people;

InteracGng with RelaGonal Data

> SELECT * FROM people;

RequesGng Specific Fields

> SELECT name, state FROM people;

RequesGng Specific Fields

> SELECT name, state FROM people;

RequesGng Specific Rows

> SELECT name, state FROM people WHERE year < 2012;

RequesGng Specific Rows

> SELECT name, state FROM people WHERE year < 2012;

Two Simple Tables

owner species name

Alex Cactus Marvin

Joey Cat Brain

Sean None

Paris Unknown

Joining Two Tables

owner species name

Alex Cactus Marvin

Joey Cat Brain

Sean None

Paris Unknown

> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner

Joining Two Tables

owner species name

Alex Cactus Marvin

Joey Cat Brain

Sean None

Paris Unknown

Joining Two Tables

owner species name

Alex Cactus Marvin

Joey Cat Brain

Sean None

Paris Unknown

Joining Two Tables

owner state pet

Alex Maryland Marvin

Joey Maryland Brain

Sean Texas

Paris Maryland

Varying ImplementaGon of JOIN

owner state pet

Alex Maryland Marvin

Joey Maryland Brain

Sean Texas ?

Paris Maryland ?

Familiar interface, but more powerful.

Cloudera Impala

•  InteracGve query on Hadoop •  think seconds, not minutes

•  Nearly ANSI-‐92 standard SQL •  compaGble with HiveQL

•  NaGve MPP query engine •  built for low-‐latency queries

Cloudera Impala – Design Choices

•  NaGve daemons, wriVen in C/C++ •  No JVM, no MapReduce •  Saturate disks on reads •  Uses in-‐memory HDFS caching

•  Re-‐uses Hive metastore •  Not as fault-‐tolerant as MapReduce

Cloudera Impala – Architecture

•  Impala Daemon •  runs on every node •  handles client requests •  handles query planning & execuGon

•  State Store Daemon •  provides name service •  metadata distribuGon •  used for finding data

Impala Query ExecuGon

Query Planner Query Coordinator Query Executor

HDFS DN HBase

SQL App ODBC

Hive Metastore HDFS NN Statestore

HDFS DN HBase

SQL request

1) Request arrives via ODBC/JDBC/HUE/Shell

HDFS DN HBase

SQL App ODBC

HDFS DN HBase

2) Planner turns request into collecPons of plan fragments 3) Coordinator iniPates execuPon on impalad(s) local to data

HDFS DN HBase

SQL App ODBC

HDFS DN HBase

4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client

Query results

Cloudera Impala – Results

•  Allows for fast iteraGon/discovery •  How much faster?

•  3-‐4x faster on I/O bound workloads •  up to 45x faster on mulG-‐MR queries •  up to 90x faster on in-‐memory cache

Hold onto something, folks.

What’s Next?

•  Download Hadoop! •  CDH available at www.cloudera.com •  Already done that? Contribute…

•  Cloudera provides pre-‐loaded VMs •  hVp://Gny.cloudera.com/quickstartvm

•  Clone our repos! •  hVps://github.com/cloudera

PORTLAND Special thanks:

Preferably related to the talk… or not.

QuesGons?

Thank You! Alex Moundalexis @technmsg We’re hiring, kids! Well, not kids.

cloudera impala

Documents

combat cyber threats with cloudera impala & apache hadoop

simbajdbcdriverforcloudera impala ... · simba jdbc driver...

impala: open source, native analytic database for apache...

cloudera impala overview (via scott leberknight)

technical overview on cloudera impala

cloudera odbc driver for impala installation and...

cloudera impala

cloudera odbc driver for impala installation and ......

cloudera impala: a modern sql engine for hadoop

cloudera data analyst training: using pig, hive, and...

cloudera impala technical overview

cloudera impala internals

setting up a hadoop cluster with cloudera manager and impala

performance evaluation of cloudera impala ga

cloudera impala - las vegas big data meetup nov 5th 2014

cloudera jdbc driver for impala installation and ......

simba odbc driver for cloudera impala installation and...

introduction to cloudera impala

apache atlas reference - cloudera · cloudera, cloudera...

cloudera impala release notes