towards automatic optimization of mapreduce programs (position paper) shivnath babu duke university
TRANSCRIPT
![Page 1: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/1.jpg)
Towards Automatic Optimization of MapReduce Programs
(Position Paper)
Shivnath BabuShivnath Babu
Duke UniversityDuke University
![Page 2: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/2.jpg)
JAQL
Roadmap
• Call to action to improve automatic optimization techniques in MapReduce frameworks
• Challenges & promising directions
Hadoop
HDFS
Pig Hive …
![Page 3: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/3.jpg)
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as aMapReduce job
![Page 4: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/4.jpg)
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as aMapReduce job
![Page 5: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/5.jpg)
Map Wave 1
ReduceWave 1
Map Wave 2
ReduceWave 2
Input Splits
Lifecycle of a MapReduce JobTime
How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?
![Page 6: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/6.jpg)
Job Configuration Parameters
• 190+ parameters in Hadoop
• Set manually or defaults are used
• Are defaults or rules-of-thumb good enough?
![Page 7: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/7.jpg)
Ru
nn
ing
tim
e (
se
co
nd
s)
Experiments
On EC2 andlocal clusters
Ru
nn
ing
tim
e (s
eco
nd
s)
Ru
nn
ing
tim
e (
min
ute
s)
Ru
nn
ing
tim
e (m
inu
tes)
![Page 8: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/8.jpg)
• Performance at default and rule-of-thumb settings can be poor
• Cross-parameter interactions are significant
Illustrative Result: 50GB Terasort17-node cluster, 64+32 concurrent map+reduce slots
mapred.reduce.tasks
io.sort.factor
io.sort.record.percent
10 10 0.15
Runningtime
10 500 0.15
28 10 0.15
300 10 0.15
300 500 0.15
Based onpopularrule-of-thumb
![Page 9: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/9.jpg)
Problem Space
Current approaches:• Predominantly manual• Post-mortem analysis
Job configurationparameters
Declarative HiveQL/Pigoperations
Multi-jobworkflows
Performanceobjectives
Cost in pay-as-you-goenvironment
Energyconsiderations
Com
plex
ity
Spa
ce o
f ex
ecut
ion
choi
ces
Is this where we want to be?
![Page 10: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/10.jpg)
Good planGood settingof parameters
Can DB Query Optimization Technology Help?
But:
– MapReduce jobs are not declarative
– No schema about the data
– Impact of concurrent jobs & scheduling?
– Space of parameters is huge
Optimizer:• Enumerate• Cost • Search
QueryDatabaseExecution
Engine
MapReducejob
Hadoop Results
Can we:
– Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years
• Or innovate!
– Exploit design & usage properties of MapReduce frameworks
![Page 11: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/11.jpg)
Spectrum of Query Optimizers
Conventional
OptimizersRule-based
Cost models + statistics about data
AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks
Insight: Predictability(RBO) >> Predictability(CBO)
![Page 12: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/12.jpg)
Spectrum of Query Optimizers
Conventional
OptimizersRule-based
Cost models + statistics about data
AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks
Insight: Predictability(RBO) >> Predictability(CBO)
LearningOptimizers(learn from
execution & adapt)
TuningOptimizers(proactively
try different plans)
![Page 13: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/13.jpg)
Spectrum of Query Optimizers
Conventional
OptimizersRule-based
Cost models + statistics about data
LearningOptimizers(learn from
execution & adapt)
Exploit usage & design properties of MapReduce frameworks:
• High ratio of repeated jobs to new jobs
• Schema can be learned (e.g., Pig scripts)
• Common sort-partition-merge skeleton
• Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results)
• Fine-grained and pluggable scheduler
TuningOptimizers(proactively
try different plans)
![Page 14: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University](https://reader035.vdocument.in/reader035/viewer/2022062408/56649eef5503460f94bfecc1/html5/thumbnails/14.jpg)
Summary• Call to action to improve automatic optimization
techniques in MapReduce frameworks– Automated generation of optimized Hadoop configuration
parameter settings, HiveQL/Pig/JAQL query plans, etc.
– Rich history to learn from
– MapReduce execution creates unique opportunities/challenges