apache hadoop india summit 2011 talk "scheduling in mapreduce using machine learning...
DESCRIPTION
TRANSCRIPT
Scheduling in MapReduce using Machine Learning Techniques
IIIT Hyderabad
Vasudeva Varma [email protected] Nanduri [email protected]
Cloud Computing GroupSearch and Information Extraction Lab
http://search.iiit.ac.in
Agenda
• Cloud Computing Group @ IIIT Hyderabad• Admission Control• Task Assignment• Conclusion
2
Cloud Computing Group @ IIIT Hyderabad
• Search and Information Extraction– Large datasets– Clusters of machines– Web crawling– Data intensive applications
• MapReduce– Apache Hadoop
3
Research Areas
• Resource management for MapReduce– Scheduling– Data Placement
• Power aware resource management• Data management in cloud• Virtualization
4
Teaching
• Cloud Computing course– Monsoon semester (2008 onwards)– Special focus on Apache Hadoop• MapReduce and HDFS• Mahout
– Virtualization– NoSQL databases– Guest lectures from industry experts
5
Learning Based Admission Control and Task Assignment in MapReduce
• Learning based approach• Admission Control– Should we accept a job for execution in the
cluster?• Task Assignment– Which task to choose for running on a given node?
6
Admission Control
Deciding if and which request to accept from a set of incoming requests
Critical in achieving better QoS Important to prevent over committing Needed to maximize the utility from the
perspective of a service provider
7
• Web services interface for MR jobs• Users search jobs through repositories• Select one that matches their criteria• Launch it on clusters managed by service provider• Service providers rent infrastructure from IaaS provider
MapReduce as a Service
8
Three phase Soft and hard deadlines Decay parameters Provison for service provider
penalty
Utility Functions
9
Based on Expected Utility Hypothesis from decision theory
Accept a job that maximizes the expected utility
Use pattern classifier to classify incoming jobs
Two classes Utility functions for prioritizing
Our Approach
10
Feature Vector
Given as input to the classifier Contains job specific and cluster specific parameters Includes variables that might affect admission decision
Cluster Specific
Used map slots
Used reduce slots
Pending maps
Pending reduces
Finishing jobs
Map time average
Reduce time average
Job Specific
Number of maps
Number of reduces
Mean map task time
Mean reduce task time
11
Bayesian Classifier
Naive Bayes Assumption Conditionally independent parameters
Works well in practice Use past events to predict future outcomes
Application of Bayes theorem while computing probabilities
Incremental Learning – efficient w.r.t. memory usage
Simple to implement
12
Evaluation
Success/Failure criteria: Load management Simulation Baseline
Myopic – Immediately select job that has maximum utility
Random – Randomly select one job from the candidate jobs
13
Algorithm Accuracy
14
Comparison with baseline
Algorithm Achieved Load Average
Random 42.11
Myopic 42.09
Our algorithm 0.97
15
Meeting Deadlines
16
Task Assignment
Deciding if a Task can be assigned on a node Learning based technique Extension of the work presented before
17
Learning Scheduler
18
Features of Learning Scheduler
• Flexible task assignment – based on state of resources
• Consider job profile while allocating• Tries to avoid overloading task trackers• Allow users to control assignment by
specifying priority functions• Incremental learning
19
Using Classifier
• Use a pattern classifier to classify candidate jobs
• Two classes: good and bad• Good tasks don't overload task trackers• Overload: A limit set on system load average
by the admin
20
Feature Vector
• Job features– CPU, memory, network and disk usage of a job
• Node properties– Static: Number of processors, maximum physical
and virtual memory, CPU Frequency– Dynamic: State of resources, Number of running
map tasks, Number of running reduce tasks
21
Job Selection
• From the candidates labelled as good select one with maximum priority
• Create a task of the selected job
22
Priority (Utility) Functions
• Policy enforcement– FIFO: U(J) = J.age– Revenue oriented
• If priority of all jobs is equal, scheduler will always assign task that has the maximum likelihood of being labelled good.
23
Job Profile
• Users submit 'hints' about job performance• Estimate job's resource consumption on a
scale of 10, 10 being the highest.• This data is passed at job submission time
through job parameters:– learnsched.jobstat.map - “1:2:3:4”
• This scheduler is made open-source at http://code.google.com/p/learnsched/
24
Evaluation
• Evaluation work load– TextWriter– WordCount– WordCount + 10ms delay– URLGet– URLToDisk– CPU Activity
25
Learning Behaviour
26
Classifier Accuracy
27
Conclusions
Feedback informed classifiers can be used effectively
Better QoS than naive approaches Less runtime happy users more revenue
for the service provider
28
Thank you
IIIT Hyderabad
Questions/Suggestions/Comments?Vasudeva Varma [email protected] Nanduri [email protected]
Cloud Computing GroupSearch and Information Extraction Lab
http://search.iiit.ac.in