qubole hadoop-summit-2013-europe

42
Cloud Friendly Hadoop & Hive Joydeep Sen Sarma Qubole

Upload: joydeep-sen-sarma

Post on 11-Jun-2015

839 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Qubole hadoop-summit-2013-europe

Cloud Friendly Hadoop & Hive

Joydeep Sen Sarma

Qubole

Page 2: Qubole hadoop-summit-2013-europe

2

Agenda

What is Qubole Data Service

Hadoop as a Service in Cloud

Hive as a Service in Cloud

Page 3: Qubole hadoop-summit-2013-europe

3

Qubole Data Service

AWS S3

AWS EC2

Page 4: Qubole hadoop-summit-2013-europe

Hadoop

Qubole Data Service

Sqoop Oozie Pig Hive

AWS S3

API

AWS EC2

Page 5: Qubole hadoop-summit-2013-europe

Hadoop

5

Qubole Data Service

Sqoop Oozie Pig Hive

AWS S3

API

AWS EC2

S3://adco/logs

Mysql

Vertica

Page 6: Qubole hadoop-summit-2013-europe

6

Hadoop

6

Qubole Data Service

Sqoop Oozie Pig Hive

AWS S3

API

ODBC SDK

AWS EC2

Explore – Integrate – Analyze – Schedule

S3://adco/logs

Mysql

Vertica

Page 7: Qubole hadoop-summit-2013-europe

7

Hadoop

7

Qubole Data Service

Sqoop Oozie Pig Hive

AWS S3

API

ODBC SDK

AWS EC2

Explore – Integrate – Analyze – Schedule

S3://adco/logs

Mysql

Vertica

Page 8: Qubole hadoop-summit-2013-europe

8

Agenda

• What is Qubole Data Service

• Hadoop as a Service in Cloud

• Hive as a Service in Cloud

Page 9: Qubole hadoop-summit-2013-europe

9

Step 1(Optional): Setup Hadoop

Page 10: Qubole hadoop-summit-2013-europe

10

Step 2: Fire Away

AdCo Hadoop

Page 11: Qubole hadoop-summit-2013-europe

11

Step 2: Fire Away

select t.county, count(1) from (select

transform(a.zip) using ‘geo.py’ as

a.county from SMALL_TABLE a) t

group by t.county;

AdCo Hadoop

Page 12: Qubole hadoop-summit-2013-europe

12

Step 2: Fire Away

select t.county, count(1) from (select

transform(a.zip) using ‘geo.py’ as

a.county from SMALL_TABLE a) t

group by t.county;

AdCo Hadoop

Page 13: Qubole hadoop-summit-2013-europe

13 13

Step 2: Fire Away

select t.county, count(1) from (select

transform(a.zip) using ‘geo.py’ as

a.county from SMALL_TABLE a) t

group by t.county;

insert overwrite table dest

select a.id, a.zip, count(distinct b.uid)

from ads a join LARGE_TABLE b on (a.id=b.ad_id)

group by a.id, a.zip;

hadoop jar –Dmapred.min.split.size=32000000

myapp.jar –partitioner .org.apache…

AdCo Hadoop

Page 14: Qubole hadoop-summit-2013-europe

14 14

Step 2: Fire Away

select t.county, count(1) from (select

transform(a.zip) using ‘geo.py’ as

a.county from SMALL_TABLE a) t

group by t.county;

insert overwrite table dest

select a.id, a.zip, count(distinct b.uid)

from ads a join LARGE_TABLE b on (a.id=b.ad_id)

group by a.id, a.zip;

hadoop jar –Dmapred.min.split.size=32000000

myapp.jar –partitioner .org.apache…

AdCo Hadoop

Page 15: Qubole hadoop-summit-2013-europe

15

Step 2: Fire Away

hadoop jar –Dmapred.min.split.size=32000000

myapp.jar –partitioner .org.apache…

AdCo Hadoop

Page 16: Qubole hadoop-summit-2013-europe

16

Step 2: Fire Away

hadoop jar –Dmapred.min.split.size=32000000

myapp.jar –partitioner .org.apache…

AdCo Hadoop

Page 17: Qubole hadoop-summit-2013-europe

17

Step 2: Fire Away

AdCo Hadoop

Page 18: Qubole hadoop-summit-2013-europe

18

Come back anytime

Page 19: Qubole hadoop-summit-2013-europe

19

Hadoop as Service

1. Detect when cluster is required

– Not all Hive statements require cluster (EXPLAIN/SHOW/..)

2. Atomically create cluster

– Long running process, concurrency control using Mysql

3. Shutdown when not in use

– Do on hour boundary (whose?)

– Not if User Sessions are active!

Page 20: Qubole hadoop-summit-2013-europe

20

Hadoop as Service

• Archive Job History/Logs to S3 – Transparent access to Old jobs

• Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR

– Use right number of slots per machine

• Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts

Page 21: Qubole hadoop-summit-2013-europe

21

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

Page 22: Qubole hadoop-summit-2013-europe

22

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

insert overwrite table dest

select … from ads join

campaigns on …group by …;

Page 23: Qubole hadoop-summit-2013-europe

23

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

insert overwrite table dest

select … from ads join

campaigns on …group by …;

Page 24: Qubole hadoop-summit-2013-europe

24

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

insert overwrite table dest

select … from ads join

campaigns on …group by …;

Page 25: Qubole hadoop-summit-2013-europe

25

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

insert overwrite table dest

select … from ads join

campaigns on …group by …; Progress

Page 26: Qubole hadoop-summit-2013-europe

26

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

insert overwrite table dest

select … from ads join

campaigns on …group by …; Progress

Demand

Supply

Page 27: Qubole hadoop-summit-2013-europe

27

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

insert overwrite table dest

select … from ads join

campaigns on …group by …; Progress

Demand

Supply

Page 28: Qubole hadoop-summit-2013-europe

28

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

insert overwrite table dest

select … from ads join

campaigns on …group by …; Progress

Page 29: Qubole hadoop-summit-2013-europe

29

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

insert overwrite table dest

select … from ads join

campaigns on …group by …; Progress

Page 30: Qubole hadoop-summit-2013-europe

30

Scaling Down

1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today)

– Don’t go below minimum cluster size

2. Remove node from Map-Reduce Cluster

3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating

– One surviving replica and we are Done.

4. Delete Instance

Page 31: Qubole hadoop-summit-2013-europe

31 31

Spot Instances

On an average 50-60% cheaper

Page 32: Qubole hadoop-summit-2013-europe

32

Spot Instance: Challenges

• Can lose Spot nodes anytime

– Disastrous for HDFS

– Hybrid Mode: Use mix of On-Demand and Spot

– Hybrid Mode: Keep one replica in On-Demand nodes

• Spot Instances may not be available

– Timeout and use On-Demand nodes as fallback

Page 33: Qubole hadoop-summit-2013-europe

33

Agenda

What is Qubole Data Service

Hadoop as a Service in Cloud

Hive as a Service in Cloud

Page 34: Qubole hadoop-summit-2013-europe

34

Query History/Results

Page 35: Qubole hadoop-summit-2013-europe

35

Cheap to Test

Evaluate expressions on sample data

Page 36: Qubole hadoop-summit-2013-europe

36

Cheap to Test

Run Query on Sample

Page 37: Qubole hadoop-summit-2013-europe

37

Fastest Hive SaaS

• Works with Small Files!

– Faster Split Computation (8x)

– Prefetching S3 files (30%)

Page 38: Qubole hadoop-summit-2013-europe

38

Fastest Hive SaaS

• Works with Small Files!

– Faster Split Computation (8x)

– Prefetching S3 files (30%)

• Stable JVM Reuse!

– Fix re-entrancy issues

– 1.2-2x speedup

Page 39: Qubole hadoop-summit-2013-europe

39

Fastest Hive SaaS

• Works with Small Files!

– Faster Split Computation (8x)

– Prefetching S3 files (30%)

• Direct writes to S3

– HIVE-1620

• Stable JVM Reuse!

– Fix re-entrancy issues

– 1.2-2x speedup

Page 40: Qubole hadoop-summit-2013-europe

40

Fastest Hive SaaS

• Works with Small Files!

– Faster Split Computation (8x)

– Prefetching S3 files (30%)

• Direct writes to S3

– HIVE-1620

• Stable JVM Reuse!

– Fix re-entrancy issues

– 1.2-2x speedup

• Columnar Cache – Use HDFS as cache for S3

– Upto 5x faster for JSON data

Page 41: Qubole hadoop-summit-2013-europe

41

Fastest Hive SaaS

• Works with Small Files!

– Faster Split Computation (8x)

– Prefetching S3 files (30%)

• Direct writes to S3

– HIVE-1620

• NEW – Multi-Tenant Hive

Server

• Stable JVM Reuse!

– Fix re-entrancy issues

– 1.2-2x speedup

• Columnar Cache – Use HDFS as cache for S3

– Upto 5x faster for JSON data

Page 42: Qubole hadoop-summit-2013-europe

Questions?

@Qubole

Free Trial: www.qubole.com