presented by: sailighavat shikhasoni a study on...

23
A STUDY ON MAPREDUCE Presented by: Saili Ghavat Shikha Soni

Upload: others

Post on 15-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

A STUDY ON MAPREDUCE

Presented by:Saili GhavatShikha Soni

Page 2: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

What is common between all of them?

Page 3: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Map Reduce‐ Introduction Framework for parallel computing 

Prominent parallel data processing tool

Uses clustered resource

Very good level of abstraction for the programmers, and do not 

have to deal with issues of parallelization, load balancing, fault 

tolerance

Data Analysis 

Map and Reduce functions 

Map : gives a key value pair

Reduce : for each unique key

Page 4: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Step 1: Splitting the Input

Large data is initially divided into large number of smaller portions. The data is divided such that  we have splits equal to the number of worker machines, thus each worker has something to work on.

Step2: Master and worker co‐ordination.Master

Worker Worker Worker Worker

Page 5: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Step3: Mapping by each workerEach worker now starts generating key value pairs of the data assigned to them. So the map function gets rid of all the irrelevant data and just passes on the key value pairs of the data intended to be filtered or sorted, thus linearly scaling the performance  because of the parallelism.

Step4: Partitioning within the workers

Map worker

Partition 1

Partition 2

Partition 3

Page 6: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

The partitioning is responsible for segregating the data further. The function is simply hash of key modulo. It becomes easier in the reduce stage.

Step5: Reduce SortThe map workers are done with their work and now the reduce workers are notified to start working using the data returned by the  mapping worker. The reduce worker contacts every map worker via remote procedure calls to get the (key, value)data that was targeted for its partition. This data is then sorted by the keys.Thus at the end of this step we have the data with the same key grouped together.

Page 7: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Step6: Final reducing step

Thus this step returns the required goal of running the map reduce file. This can be finding a word from large file data or studying the web search logs, or any kind of data processing.

These functions are run on Distributed file systems, like GFS, AFS, etc.

Intermediate files 

Intermediate files 

Intermediate files 

Reduce worker

Page 8: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Understanding Map Reduce better

Page 9: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Lisp Map & Reduce The map function takes a function and a set of values as a parameter.

This function is then applied to each of the value from the data.(map ‘length ‘(() (a) (ab) (abc)))

The above function applies this function to each of the values. Length  of the values are returned.

Now the ‘reduce’ function is given a binary function and a set of values. All  the returned values are thus combined.

Page 10: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Google’s Observation

Key Word Search

MAP: The basic working of the search engine that 

is finding the key word and returning the URL

REDUCE: This stage combines all the resulting 

URLs which have the keyword and return it.

Page 11: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Google’s need for MapReduce

Client request of logs and web resources, largest 

Search Engine 

Large derived data 

Easy calculation, large processing 

Processing distributed amongst machines

Google’s new abstraction allowing simple 

computations, hiding messy parallelism details

Page 12: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Hadoop‐ Open Source for Map Reduce

Google’s patent

Not an open source

Hadoop creation

Basis of Hadoop and MapReduce

Page 13: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf

Page 14: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Scheduling in Hadoop Two Types: Fair Scheduling and Capacity Scheduling

Fair Scheduling: works when we have Single queue of jobs

Equal share of physical resources Single MR job, occupies the whole of the cluster Capacity Scheduling: More sophisticated type of scheduling 

Multiple queues can work along

Page 15: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Each process in each queue is guaranteed to get the cluster resource

MRShare : Framework for sharing multi query executions in MapReduce

finds an optimal way of grouping a set of queries using

dynamic programming. transforms a batch of queries into a new batch that will be executed more efficiently

merging jobs into groups and evaluating each group as a single query.

Page 16: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Suppose |D| is the size of input data that n MR jobs share

Complexity of sorting the combined mapped output of all jobs will be O(n ∙ |D|log(n ∙ |D|))

Page 17: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Some problems where MapReducehas been used Distributed grep (search for words)1. Map: emit a line if it matches a given pattern2. Reduce: just copy the intermediate data to the output

Count URL access frequency1. Map: process logs of web page access; 

output2. Reduce: add all values for the same URL

Page 18: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Debates Major step backwards in parallel processing 

compared to DBMS

Hadoop scalable but achieves very low efficiency

Hadoop stands out in the “Gray Sort Benchmark 

test” for 1ooTB sorting.

Not a cheap solution

Cost of maintaining cluster difficult

Increases fault tolerance

Page 19: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Advantages Simple and easy to use

Flexible

Independent of the storage

Fault tolerant

High scalability

Page 20: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Limitations of MapReduce Low efficiency

Cannot be used when computations depend on 

previously calculated values

Can handle large data sets but constraints 

program’s ability smaller data items

Page 21: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Software in place of Hadoop and MapReduce DISCO

Greenplum

Aster Data

Page 22: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

Companies who have adopted similar algorithms A9.com AOL Facebook The New York Times Last.fm Baidu.com Joost Veoh

Page 23: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values

References http://www.cs.arizona.edu/~bkmoon/papers/sigmodr

ec11.pdf http://csci89802.blogspot.com/2012/10/limitations‐

of‐mapreduce‐where‐not‐to.html http://www.dbms2.com/2008/01/18/the‐great‐

mapreduce‐debate/ http://www.cse.buffalo.edu/~stevko/courses/cse704/f

all10/papers/cse704‐mapreduce.pdf http://static.googleusercontent.com/media/research

.google.com/en/us/archive/mapreduce‐osdi04.pdf http://blog.cloudera.com/blog/2014/03/the‐truth‐

about‐mapreduce‐performance‐on‐ssds/ https://www.cs.rutgers.edu/~pxk/417/notes/content/

mapreduce.html