![Page 1: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/1.jpg)
Characterization of Hadoop Jobs using Unsupervised
LearningSonali Aggarwal, Shashank Phadke, Milind Bhandarkar
[[email protected], {sphadke,milindb}@yahoo-inc.com]
![Page 2: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/2.jpg)
Hadoop at Yahoo!
• Behind Every Click !
• 38,000+ Servers
• Largest cluster is 4000+ servers
• 1+ Million Jobs per month
• 170+ PB of Storage
• 10+ TB Compressed Data Added Per Day
• 1000+ Users
![Page 3: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/3.jpg)
!"#
$"#
%"#
&"#
'"#
("#
)"#
*"#
+"#
"#
*'"#
*""#
+'"#
+""#
'"#
"#
*""&# *""%# *""$# *""!# *"+"#
,-./0#
)$1#2345346#
+%"#78#29-4/:3#
+;<#;-=9>?0#@-A6#
!"#$%&'(%)#*)+,-.,-%)
/,0&120,%)
3,%,&-4")
+45,'4,)678&40)
9&5:2)/-#($4;#')
B/.--C#2345346#
B/.--C#29-4/:3#D78E#
Hadoop Growth
![Page 4: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/4.jpg)
Hadoop Clusters
• Hadoop Dev, QA, Benchmarking (10%)
• Sandbox, Release Validation (10%)
• Science, Ad-Hoc Usage (50%)
• Production (30%)
![Page 5: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/5.jpg)
Benchmarking Hadoop - Part 1
•Micro benchmarks
• Sort
• TestDFSIO
• SmallJobs
•NNBench
• ...
![Page 6: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/6.jpg)
Benchmarking Hadoop - Part 2
• Application Kernels - GridMix - 1
•Most common “real” applications
•Machine Learning, ETL, Graph Algorithms
• Synthetic data
![Page 7: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/7.jpg)
GridMix
• GridMix Version 2
• GridMix Version 1 + Load Generator
• Synthetic Scheduling
• GridMix Version 3
• Real Scheduling + Synthetic Load Generator
![Page 8: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/8.jpg)
GridMix V3
• Collect production job metrics + Submission statistics
• Generate workload
• Execute on benchmarking cluster
•With different versions of Hadoop
![Page 9: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/9.jpg)
GridMix Problems
• Benchmarking cluster has less resources than production clusters
• Impossible to simulate large intervals
•Different production clusters have different workload characteristics
•Workload characteristics change over time
![Page 10: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/10.jpg)
Our Solution
• Insight: Most applications are executed periodically
• Automatically determine prominent application characteristics
• Simulate only prominent applications
![Page 11: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/11.jpg)
Our Approach
• Fetch Job metrics, and Job configurations (job.xml)
• Extract “features”
•Determine number of job clusters
• Find centroids of job clusters
![Page 12: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/12.jpg)
Task Counters
![Page 13: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/13.jpg)
Job Metrics
•Number of Map Tasks
•Number of Reduce Tasks
• Slots per task
• InputFormat / OutputFormat
• Type of output and intermediate data compression
![Page 14: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/14.jpg)
Task Metrics
• HDFS Bytes Read/Written
• File Bytes Read/Written
• Combiner Records Ratio
• Shuffle Bytes per Reduce Task
• ...
•Mean & StdDev across tasks: Job Metric
![Page 15: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/15.jpg)
Job Clustering
• Rescale features
• Find and eliminate correlated features
• Random sampling of jobs
• sqrt(N)
•Determine Number of clusters
•Minimize “within group sum of squares”
![Page 16: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/16.jpg)
Feature Vectors
•Numeric Features
• e.g. Number of tasks
• Rescale, mean=0, stddev=1
•Nominal Features
• e.g. Type of Compression
![Page 17: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/17.jpg)
Number of Clusters
![Page 18: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/18.jpg)
K-Means Clustering
• Find initial seeds
• Hierarchical Agglomerative Clustering
• Euclidean distance between jobs
![Page 19: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/19.jpg)
Job Clusters
![Page 20: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/20.jpg)
Centroids
![Page 21: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/21.jpg)
Largest Cluster
• nMaps: 79, nReduces=28
• HDFS Bytes read / Map = 44.82 MB
• HDFS Bytes written / Reduce = 54.85 MB
• Input Records per Map = 334K
•Output Records per Reduce = 235K
![Page 22: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/22.jpg)
Validating GridMix v3
• Clustering real workload
• Clustering corresponding GridMix3 workload
• Compare clusters
• Results in same number and sizes of clusters
![Page 23: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/23.jpg)
Conclusions
• GridMix v3 accurately simulates real workload
•Number of prominent types of Hadoop jobs on Yahoo! production cluster is 8
• Job Clustering can be used to generate benchmark suites
![Page 24: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/24.jpg)
Future Work
• Automate the process
• Incorporate in GridMix version 4
• Extend to Pig job chains
• Extend to Oozie workflows
• Contribute to Open Source
![Page 25: Characterization of Hadoop Jobs using Unsupervised Learningsalsahpc.indiana.edu/CloudCom2010/slides/PDF...Hadoop at Yahoo! • Behind Every Click ! • 38,000+ Servers • Largest](https://reader035.vdocument.in/reader035/viewer/2022071211/6022c07a6b4fa03b015ca25a/html5/thumbnails/25.jpg)
Questions ?