towards automatic optimization of mapreduce programs (position paper)

14
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Shivnath Babu Duke University Duke University

Upload: maisie-barnett

Post on 31-Dec-2015

48 views

Category:

Documents


1 download

DESCRIPTION

Towards Automatic Optimization of MapReduce Programs (Position Paper). Shivnath Babu Duke University. Roadmap. Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges & promising directions. …. Pig. Hive. JAQL. Hadoop. HDFS. Map function. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Towards Automatic Optimization of MapReduce Programs

(Position Paper)

Shivnath BabuShivnath Babu

Duke UniversityDuke University

Page 2: Towards Automatic Optimization of MapReduce Programs (Position Paper)

JAQL

Roadmap

• Call to action to improve automatic optimization techniques in MapReduce frameworks

• Challenges & promising directions

Hadoop

HDFS

Pig Hive …

Page 3: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 4: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 5: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Page 6: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Job Configuration Parameters

• 190+ parameters in Hadoop

• Set manually or defaults are used

• Are defaults or rules-of-thumb good enough?

Page 7: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Ru

nn

ing

tim

e (

se

co

nd

s)

Experiments

On EC2 andlocal clusters

Ru

nn

ing

tim

e (s

eco

nd

s)

Ru

nn

ing

tim

e (

min

ute

s)

Ru

nn

ing

tim

e (m

inu

tes)

Page 8: Towards Automatic Optimization of MapReduce Programs (Position Paper)

• Performance at default and rule-of-thumb settings can be poor

• Cross-parameter interactions are significant

Illustrative Result: 50GB Terasort17-node cluster, 64+32 concurrent map+reduce slots

mapred.reduce.tasks

io.sort.factor

io.sort.record.percent

10 10 0.15

Runningtime

10 500 0.15

28 10 0.15

300 10 0.15

300 500 0.15

Based onpopularrule-of-thumb

Page 9: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Problem Space

Current approaches:• Predominantly manual• Post-mortem analysis

Job configurationparameters

Declarative HiveQL/Pigoperations

Multi-jobworkflows

Performanceobjectives

Cost in pay-as-you-goenvironment

Energyconsiderations

Com

plex

ity

Spa

ce o

f ex

ecut

ion

choi

ces

Is this where we want to be?

Page 10: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Good planGood settingof parameters

Can DB Query Optimization Technology Help?

But:

– MapReduce jobs are not declarative

– No schema about the data

– Impact of concurrent jobs & scheduling?

– Space of parameters is huge

Optimizer:• Enumerate• Cost • Search

QueryDatabaseExecution

Engine

MapReducejob

Hadoop Results

Can we:

– Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years

• Or innovate!

– Exploit design & usage properties of MapReduce frameworks

Page 11: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Spectrum of Query Optimizers

Conventional

OptimizersRule-based

Cost models + statistics about data

AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)

Page 12: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Spectrum of Query Optimizers

Conventional

OptimizersRule-based

Cost models + statistics about data

AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)

LearningOptimizers(learn from

execution & adapt)

TuningOptimizers(proactively

try different plans)

Page 13: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Spectrum of Query Optimizers

Conventional

OptimizersRule-based

Cost models + statistics about data

LearningOptimizers(learn from

execution & adapt)

Exploit usage & design properties of MapReduce frameworks:

• High ratio of repeated jobs to new jobs

• Schema can be learned (e.g., Pig scripts)

• Common sort-partition-merge skeleton

• Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results)

• Fine-grained and pluggable scheduler

TuningOptimizers(proactively

try different plans)

Page 14: Towards Automatic Optimization of MapReduce Programs (Position Paper)

Summary• Call to action to improve automatic optimization

techniques in MapReduce frameworks– Automated generation of optimized Hadoop configuration

parameter settings, HiveQL/Pig/JAQL query plans, etc.

– Rich history to learn from

– MapReduce execution creates unique opportunities/challenges