apache hadoop india summit 2011 talk "scheduling in mapreduce using machine learning...

29
Scheduling in MapReduce using Machine Learning Techniques IIIT Hyderabad Vasudeva Varma [email protected] Radheshyam [email protected] Cloud Computing Group Search and Information Extraction Lab http://search.iiit.ac.in

Upload: yahoo-developer-network

Post on 02-Nov-2014

3.624 views

Category:

Technology


7 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Scheduling in MapReduce using Machine Learning Techniques

IIIT Hyderabad

Vasudeva Varma [email protected] Nanduri [email protected]

Cloud Computing GroupSearch and Information Extraction Lab

http://search.iiit.ac.in

Page 2: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Agenda

• Cloud Computing Group @ IIIT Hyderabad• Admission Control• Task Assignment• Conclusion

2

Page 3: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Cloud Computing Group @ IIIT Hyderabad

• Search and Information Extraction– Large datasets– Clusters of machines– Web crawling– Data intensive applications

• MapReduce– Apache Hadoop

3

Page 4: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Research Areas

• Resource management for MapReduce– Scheduling– Data Placement

• Power aware resource management• Data management in cloud• Virtualization

4

Page 5: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Teaching

• Cloud Computing course– Monsoon semester (2008 onwards)– Special focus on Apache Hadoop• MapReduce and HDFS• Mahout

– Virtualization– NoSQL databases– Guest lectures from industry experts

5

Page 6: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Learning Based Admission Control and Task Assignment in MapReduce

• Learning based approach• Admission Control– Should we accept a job for execution in the

cluster?• Task Assignment– Which task to choose for running on a given node?

6

Page 7: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Admission Control

Deciding if and which request to accept from a set of incoming requests

Critical in achieving better QoS Important to prevent over committing Needed to maximize the utility from the

perspective of a service provider

7

Page 8: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

• Web services interface for MR jobs• Users search jobs through repositories• Select one that matches their criteria• Launch it on clusters managed by service provider• Service providers rent infrastructure from IaaS provider

MapReduce as a Service

8

Page 9: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Three phase Soft and hard deadlines Decay parameters Provison for service provider

penalty

Utility Functions

9

Page 10: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Based on Expected Utility Hypothesis from decision theory

Accept a job that maximizes the expected utility

Use pattern classifier to classify incoming jobs

Two classes Utility functions for prioritizing

Our Approach

10

Page 11: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Feature Vector

Given as input to the classifier Contains job specific and cluster specific parameters Includes variables that might affect admission decision

Cluster Specific

Used map slots

Used reduce slots

Pending maps

Pending reduces

Finishing jobs

Map time average

Reduce time average

Job Specific

Number of maps

Number of reduces

Mean map task time

Mean reduce task time

11

Page 12: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Bayesian Classifier

Naive Bayes Assumption Conditionally independent parameters

Works well in practice Use past events to predict future outcomes

Application of Bayes theorem while computing probabilities

Incremental Learning – efficient w.r.t. memory usage

Simple to implement

12

Page 13: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Evaluation

Success/Failure criteria: Load management Simulation Baseline

Myopic – Immediately select job that has maximum utility

Random – Randomly select one job from the candidate jobs

13

Page 14: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Algorithm Accuracy

14

Page 15: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Comparison with baseline

Algorithm Achieved Load Average

Random 42.11

Myopic 42.09

Our algorithm 0.97

15

Page 16: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Meeting Deadlines

16

Page 17: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Task Assignment

Deciding if a Task can be assigned on a node Learning based technique Extension of the work presented before

17

Page 18: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Learning Scheduler

18

Page 19: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Features of Learning Scheduler

• Flexible task assignment – based on state of resources

• Consider job profile while allocating• Tries to avoid overloading task trackers• Allow users to control assignment by

specifying priority functions• Incremental learning

19

Page 20: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Using Classifier

• Use a pattern classifier to classify candidate jobs

• Two classes: good and bad• Good tasks don't overload task trackers• Overload: A limit set on system load average

by the admin

20

Page 21: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Feature Vector

• Job features– CPU, memory, network and disk usage of a job

• Node properties– Static: Number of processors, maximum physical

and virtual memory, CPU Frequency– Dynamic: State of resources, Number of running

map tasks, Number of running reduce tasks

21

Page 22: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Job Selection

• From the candidates labelled as good select one with maximum priority

• Create a task of the selected job

22

Page 23: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Priority (Utility) Functions

• Policy enforcement– FIFO: U(J) = J.age– Revenue oriented

• If priority of all jobs is equal, scheduler will always assign task that has the maximum likelihood of being labelled good.

23

Page 24: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Job Profile

• Users submit 'hints' about job performance• Estimate job's resource consumption on a

scale of 10, 10 being the highest.• This data is passed at job submission time

through job parameters:– learnsched.jobstat.map - “1:2:3:4”

• This scheduler is made open-source at http://code.google.com/p/learnsched/

24

Page 25: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Evaluation

• Evaluation work load– TextWriter– WordCount– WordCount + 10ms delay– URLGet– URLToDisk– CPU Activity

25

Page 26: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Learning Behaviour

26

Page 27: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Classifier Accuracy

27

Page 28: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Conclusions

Feedback informed classifiers can be used effectively

Better QoS than naive approaches Less runtime happy users more revenue

for the service provider

28

Page 29: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Thank you

IIIT Hyderabad

Questions/Suggestions/Comments?Vasudeva Varma [email protected] Nanduri [email protected]

Cloud Computing GroupSearch and Information Extraction Lab

http://search.iiit.ac.in