apache hadoop india summit 2011 talk "scheduling in mapreduce using machine learning...

Scheduling in MapReduce using Machine Learning Techniques

IIIT Hyderabad

Vasudeva Varma [email protected] Nanduri [email protected]

Cloud Computing GroupSearch and Information Extraction Lab

http://search.iiit.ac.in

Agenda

• Cloud Computing Group @ IIIT Hyderabad• Admission Control• Task Assignment• Conclusion

2

Cloud Computing Group @ IIIT Hyderabad

• Search and Information Extraction– Large datasets– Clusters of machines– Web crawling– Data intensive applications

• MapReduce– Apache Hadoop

3

Research Areas

• Resource management for MapReduce– Scheduling– Data Placement

• Power aware resource management• Data management in cloud• Virtualization

4

Teaching

• Cloud Computing course– Monsoon semester (2008 onwards)– Special focus on Apache Hadoop• MapReduce and HDFS• Mahout

– Virtualization– NoSQL databases– Guest lectures from industry experts

5

Learning Based Admission Control and Task Assignment in MapReduce

• Learning based approach• Admission Control– Should we accept a job for execution in the

cluster?• Task Assignment– Which task to choose for running on a given node?

6

Admission Control

Deciding if and which request to accept from a set of incoming requests

Critical in achieving better QoS Important to prevent over committing Needed to maximize the utility from the

perspective of a service provider

7

• Web services interface for MR jobs• Users search jobs through repositories• Select one that matches their criteria• Launch it on clusters managed by service provider• Service providers rent infrastructure from IaaS provider

MapReduce as a Service

8

Three phase Soft and hard deadlines Decay parameters Provison for service provider

penalty

Utility Functions

9

Based on Expected Utility Hypothesis from decision theory

Accept a job that maximizes the expected utility

Use pattern classifier to classify incoming jobs

Two classes Utility functions for prioritizing

Our Approach

10

Feature Vector

Given as input to the classifier Contains job specific and cluster specific parameters Includes variables that might affect admission decision

Cluster Specific

Used map slots

Used reduce slots

Pending maps

Pending reduces

Finishing jobs

Map time average

Reduce time average

Job Specific

Number of maps

Number of reduces

Mean map task time

Mean reduce task time

11

Bayesian Classifier

Naive Bayes Assumption Conditionally independent parameters

Works well in practice Use past events to predict future outcomes

Application of Bayes theorem while computing probabilities

Incremental Learning – efficient w.r.t. memory usage

Simple to implement

12

Evaluation

Success/Failure criteria: Load management Simulation Baseline

Myopic – Immediately select job that has maximum utility

Random – Randomly select one job from the candidate jobs

13

Algorithm Accuracy

14

Comparison with baseline

Algorithm Achieved Load Average

Random 42.11

Myopic 42.09

Our algorithm 0.97

15

Meeting Deadlines

16

Task Assignment

Deciding if a Task can be assigned on a node Learning based technique Extension of the work presented before

17

Learning Scheduler

18

Features of Learning Scheduler

• Flexible task assignment – based on state of resources

• Consider job profile while allocating• Tries to avoid overloading task trackers• Allow users to control assignment by

specifying priority functions• Incremental learning

19

Using Classifier

• Use a pattern classifier to classify candidate jobs

• Two classes: good and bad• Good tasks don't overload task trackers• Overload: A limit set on system load average

by the admin

20

Feature Vector

• Job features– CPU, memory, network and disk usage of a job

• Node properties– Static: Number of processors, maximum physical

and virtual memory, CPU Frequency– Dynamic: State of resources, Number of running

map tasks, Number of running reduce tasks

21

Job Selection

• From the candidates labelled as good select one with maximum priority

• Create a task of the selected job

22

Priority (Utility) Functions

• Policy enforcement– FIFO: U(J) = J.age– Revenue oriented

• If priority of all jobs is equal, scheduler will always assign task that has the maximum likelihood of being labelled good.

23

Job Profile

• Users submit 'hints' about job performance• Estimate job's resource consumption on a

scale of 10, 10 being the highest.• This data is passed at job submission time

through job parameters:– learnsched.jobstat.map - “1:2:3:4”

• This scheduler is made open-source at http://code.google.com/p/learnsched/

24

Evaluation

• Evaluation work load– TextWriter– WordCount– WordCount + 10ms delay– URLGet– URLToDisk– CPU Activity

25

Learning Behaviour

26

Classifier Accuracy

27

Conclusions

Feedback informed classifiers can be used effectively

Better QoS than naive approaches Less runtime happy users more revenue

for the service provider

28

Thank you

IIIT Hyderabad

Questions/Suggestions/Comments?Vasudeva Varma [email protected] Nanduri [email protected]

Cloud Computing GroupSearch and Information Extraction Lab

http://search.iiit.ac.in

apache hadoop india summit 2011 talk "scheduling in mapreduce using machine learning...

Technology

information

nanduriresearch

inradheshyam

pattern classifier

job

iiit

utility

task