the fifth elephant 2016: self-serve performance tuning for hadoop and spark

Self-Serve Performance Tuning for Hadoop & Spark

The Fifth Elephant 2016

Akshay RaiEngineer, Hadoop Development TeamLinkedin Dr. Elephant

Hadoop @ Linkedin c. 2008

● 1 cluster

● 20 nodes

● 10 users

● 10 workflows in production

● MapReduce, Pig

Hadoop @ Linkedin c. 2016

● > 10 clusters

● > 10000 nodes

● > 1000 users

● Thousands of queries and flows in development

● Hundreds running in Production

● MapReduce, Pig, Hive, Spark, Scalding, Gobblin, Cubert3

Scaling Hadoop Infrastructure

• Add extra machines to the cluster

• Hadoop is scalable but not that optimal!

• We cannot keep adding machines forever

• Tune given resources and minimize addition of new machines

Measuring performance

• Highlights hardware failures and poor performing components

• Scope for environment upgrades.

Cluster Level Performance Tuning

Job Level Performance Tuning6

How difficult is it to tune a Job?

• Production Gatekeeper - Let jobs go into production only after verifying it

is tuned.

• Restriction! More questions on how to tune! Spend more resources

helping people.

Here’s what we tried to achieve Job tuning!

Challenges in tuning a job

• Hadoop is designed to let users tune their jobs BUT!

• One cannot optimize if one doesn’t understand the internals of the framework

• Critical information is scattered

• Hadoop has a huge set of parameters, tuning some may impact other

You cannot tune what you do not know & you cannot improve what you cannot measure

Training Sessions

• More people, more frequent sessions.

• Hadoop experience varies with people

• Framework specific training. Pig, hive, etc

Training - Doesn’t Scale

Expert Review

Expert Review - Also Doesn’t Work

• Again not scalable

• Cannot ensure job is performing optimally, no easy comparison.

• Different people, different perspective, no consensus

• Error prone, one might overlook certain aspects.

Scaling Hadoop Infrastructure is HARD

Scaling User Productivity is much HARDER 14

Birth of Dr. Elephant

What does Dr. Elephant do?

• Help every user get the best performance from their jobs

• Analyse and compare historical executions

• Provides a platform for other performance related tools

Architecture

Rule #1 : Mapper Data Skew

Mapper Skew Problem• Varying size of splits can cause skewness in the Mapper Input

Solution to Mapper Skewness• Each Mapper should process the same amount of data

• Combine the small chunks and feed it to a single Mapper

Rule #2 : Mapper Memory

Mapper Memory Problem & Solution

• Requested Container Memory >> Task’s Consumed Memory

• Request 4 GB of container

• Actually job uses only 512 MB

• Wait longer to get 4 GB and then block 4GB of resources!

• Request a lower container memory by setting

• mapreduce.map(or reduce).memory.mb

Search

MapReduce Report

Job History

How to define a rule?

How does a Rule work?

INPUT Counters & Task Data

LOGIC Some logic to compute a value

OUTPUT Compare value against threshold levels

Customising Dr. Elephant28

Adding a Custom Rule

1. Create a new Rule and test it.

2. Create a help page defining the rule, parameters to tune etc.

3. Add the details of the Rule in the HeuristicConf.xml file <heuristic> <applicationtype>Mapreduce</applicationtype> <heuristicname>Rule Name</heuristicname> <classname>path.to.rule.class</classname> <viewname>path.to.rule.help.page</viewname></heuristic>

4. Run Dr. Elephant. It should now include the new rules.29

What else can you customize?

● Rules, set threshold levels

● Easily integrate with new schedulers (Azkaban, Airflow, Oozie, etc)

● Enable/disable and extend to new Fetchers

● Extend to newer application types and job types

Production Gatekeeper31

Automated Production Reviews | JIRA Bot

• Cluster for critical workloads

• Audit before deployment

Workflow monitoring and reports

• Monitor performance on each execution

• Compare behaviour across revisions

• Cost to Serve analysis

Open Source, April 2016

github.com / linkedin / dr-elephant34

Watchers Stars Forks 60 262 109

Let’s collectively contribute!

Pull Requests 60 +

Contributors 10 +

User Topics 50 +

Dr. Elephant Community

Coming Soon

● Real time analysis of Jobs

● Analytics for Failed Jobs

● Visualizing Workflows through DAGs

● Support for Other schedulers and Frameworks

References

Engineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark

Open Source Github Link:github.com/linkedin/dr-elephant

Mailing List & Gitterdr-elephant-users, linkedin/dr-elephant

Hadoop Summit 2015:https://www.youtube.com/watch?v=aL3OJ4YoxPA (Mark Wagner)

github.com / linkedin / dr-elephant

Thank You

Akshay Raihttps://in.linkedin.com/in/akshayrai09

the fifth elephant 2016: self-serve performance tuning for hadoop and spark

Data & Analytics

big data – spark/hadoop data services · 2017-07-10 ·...

hadoop 2 @ twitter, elephant scale

deep learning on hadoop/spark -nextml

hadoop 201 -- deeper into the elephant

hadoop & spark – using amazon emr

spark-on-yarn: empower spark applications on hadoop cluster

overview of bigdata (hadoop & spark)

webinar: from hadoop to spark introduction hadoop and spark...

hadoop world spark meetup: interactive spark in your browser

elephant grooming: quality with hadoop

cleveland hadoop users group - spark

hadoop elephant in active directory forest

introduction to spark on hadoop

active directory forest hadoop elephant in · hadoop...

hadoop spark performance comparison

hadoop with spark

course content for hadoop and spark...

spark & hadoop at production at scale

accelerating big data processing with hadoop, spark … ·...

hadoop tutorials spark - cern