lower tco - running bi projects with impala

1

Cloudera Impala Portland Big Data User Group, July 2014 Alex Moundalexis @technmsg

Thirty Seconds About Alex

•  SoluGons Architect •  aka consultant •  government •  infrastructure

•  former coder of Perl •  former administrator •  fan of Portland

2

What Does Cloudera Do?

•  product •  distribuGon of Hadoop components, Apache licensed •  enterprise tooling

•  support •  training •  services (aka consulGng) •  community

3

Disclaimer

•  Cloudera builds things soPware •  most donated to Apache •  some closed-‐source

•  Cloudera “products” I reference are open source •  Apache Licensed •  source code is on GitHub

•  hVps://github.com/cloudera

4

What This Talk Isn’t About

•  deploying •  Puppet, Chef, Ansible, homegrown scripts, intern labor

•  sizing & tuning •  depends heavily on data and workload

•  coding •  unless you count XML or CSV or SQL

•  algorithms

5

Public Domain IFCAR

CC BY-‐SA Lilian De Cassai

cloud·∙e·∙ra im·∙pal·∙a

8

/kloudˈi(ə)rə imˈpalə/ noun

a modern, open source, MPP SQL query engine for Apache Hadoop. “Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complemenGng tradiGonal MapReduce batch processing.”

9

Quick and dirty, for context.

The Apache Hadoop Ecosystem

Why “Ecosystem?”

•  In the beginning, just Hadoop •  HDFS •  MapReduce

•  Today, dozens of interrelated components •  I/O •  Processing •  Specialty ApplicaGons •  ConfiguraGon •  Workflow

10

HDFS

•  Distributed, highly fault-‐tolerant filesystem •  OpGmized for large streaming access to data •  Based on Google File System

•  hVp://research.google.com/archive/gfs.html

11

Lots of Commodity Machines

12

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce (MR)

•  Programming paradigm •  Batch oriented, not realGme •  Works well with distributed compuGng •  Lots of Java, but other languages supported •  Based on Google’s paper

•  hVp://research.google.com/archive/mapreduce.html

13

Under the Covers

14

You specify map() and reduce() functions. ��

��The framework does the

rest. 60

Apache Hive

•  AbstracGon of Hadoop’s Java API •  HiveQL “compiles” down to MR

•  a “SQL-‐like” language

•  Eases analysis using MapReduce

16

Apache Hive Metastore

•  Maps HDFS files to DB-‐like resources •  Databases •  Tables •  Column/field names, data types •  Roles/users •  InputFormat/OutputFormat

17

WHY DO WE NEED THIS? But wait…

18

20

I am not a SQL wizard by any means…

Super Shady SQL Supplement

A Simple RelaGonal Database

name state employer year

Alex Maryland Cloudera 2013

Joey Maryland Cloudera 2011

Sean Texas Cloudera 2013

Paris Maryland AOL 2011

21

>

InteracGng with RelaGonal Data






22

> SELECT * FROM people;

InteracGng with RelaGonal Data






23

> SELECT * FROM people;

RequesGng Specific Fields






24

> SELECT name, state FROM people;

RequesGng Specific Fields






25

> SELECT name, state FROM people;

RequesGng Specific Rows






26

> SELECT name, state FROM people WHERE year < 2012;

RequesGng Specific Rows






27

> SELECT name, state FROM people WHERE year < 2012;

Two Simple Tables

owner species name

Alex Cactus Marvin

Joey Cat Brain

Sean None

Paris Unknown

28

>






Joining Two Tables

owner species name

Alex Cactus Marvin

Joey Cat Brain

Sean None

Paris Unknown

29

> SELECT people.name AS owner, people.state AS state, pets.name AS pet FROM people LEFT JOIN pets ON people.name = pets.owner






Joining Two Tables

owner species name

Alex Cactus Marvin

Joey Cat Brain

Sean None

Paris Unknown

30







Joining Two Tables

owner species name

Alex Cactus Marvin

Joey Cat Brain

Sean None

Paris Unknown

31







Joining Two Tables

32


owner state pet

Alex Maryland Marvin

Joey Maryland Brain

Sean Texas

Paris Maryland

Varying ImplementaGon of JOIN

33


owner state pet

Alex Maryland Marvin

Joey Maryland Brain

Sean Texas ?

Paris Maryland ?

34

Familiar interface, but more powerful.

Cloudera Impala

Cloudera Impala

•  InteracGve query on Hadoop •  think seconds, not minutes

•  Nearly ANSI-‐92 standard SQL •  compaGble with HiveQL

•  NaGve MPP query engine •  built for low-‐latency queries

35

Cloudera Impala – Design Choices

•  NaGve daemons, wriVen in C/C++ •  No JVM, no MapReduce •  Saturate disks on reads •  Uses in-‐memory HDFS caching

•  Re-‐uses Hive metastore •  Not as fault-‐tolerant as MapReduce

36

Cloudera Impala – Architecture

•  Impala Daemon •  runs on every node •  handles client requests •  handles query planning & execuGon

•  State Store Daemon •  provides name service •  metadata distribuGon •  used for finding data

37

Impala Query ExecuGon

38

Query Planner Query Coordinator Query Executor

HDFS DN HBase

SQL App ODBC

Hive Metastore HDFS NN Statestore


HDFS DN HBase


HDFS DN HBase

SQL request

1) Request arrives via ODBC/JDBC/HUE/Shell


39


HDFS DN HBase

SQL App ODBC



HDFS DN HBase


HDFS DN HBase

2) Planner turns request into collecPons of plan fragments 3) Coordinator iniPates execuPon on impalad(s) local to data


40


HDFS DN HBase

SQL App ODBC



HDFS DN HBase


HDFS DN HBase

4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client

Query results

Cloudera Impala – Results

•  Allows for fast iteraGon/discovery •  How much faster?

•  3-‐4x faster on I/O bound workloads •  up to 45x faster on mulG-‐MR queries •  up to 90x faster on in-‐memory cache

41

42

Hold onto something, folks.

Demo

What’s Next?

•  Download Hadoop! •  CDH available at www.cloudera.com •  Already done that? Contribute…

•  Cloudera provides pre-‐loaded VMs •  hVp://Gny.cloudera.com/quickstartvm

•  Clone our repos! •  hVps://github.com/cloudera

43

PORTLAND Special thanks:

44

45

Preferably related to the talk… or not.

QuesGons?

46

Thank You! Alex Moundalexis @technmsg We’re hiring, kids! Well, not kids.

lower tco - running bi projects with impala

Engineering

joey maryland cloudera

disclaimer cloudera

yahoo hadoop cluster

sean texas cloudera

source cloudera products

apache hadoop ecosystem

cassai cloudera impala

hadoop hdfs mapreduce