map reduce lecture 2
TRANSCRIPT
-
7/31/2019 Map Reduce Lecture 2
1/47
Parallel programming,
Mapreduce modelUNIT II
-
7/31/2019 Map Reduce Lecture 2
2/47
Serial vs. Parallel Programming
A serial program consist of a sequence ofinstructions, where each instructionexecuted one after the other
In a parallel program, the processing isbroken up into parts, each of which can beexecuted concurrently.
-
7/31/2019 Map Reduce Lecture 2
3/47
The Basics Parallel Programming
Identifying sets of tasks that can runconcurrently and/or paritions of datathat can be processed concurrently
Sometimes it's just not possible:Fibonacci function
A common situation is having a large
amount of consistent data which mustbe processed.
-
7/31/2019 Map Reduce Lecture 2
4/47
huge array which can be brokenup into sub-arrays
-
7/31/2019 Map Reduce Lecture 2
5/47
implementation technique:master/worker
The MASTER: initializes the array and splits it upaccording to the number of availableWORKERS
sends each WORKER its subarray receives the results from each WORKER
The WORKER: receives the subarray from the MASTER
performs processing on the subarray returns results to MASTER
-
7/31/2019 Map Reduce Lecture 2
6/47
An example of theMASTER/WORKER technique
Approximating pi
-
7/31/2019 Map Reduce Lecture 2
7/47
Approximating pi..
The area of the square, denoted As = (2r)2or 4r2. The area of the circle, denoted Ac, is
pi * r2. So:pi = Ac / r2
As = 4r2
r2 = As / 4pi = 4 * Ac / As
-
7/31/2019 Map Reduce Lecture 2
8/47
Parallelize this method
Randomly generate points in the square
Count the number of generated points that
are both in the circle and in the square
r = the number of points in the circledivided by the number of points in the
square
PI = 4 * rr
-
7/31/2019 Map Reduce Lecture 2
9/47
NUMPOINTS = 100000; // some large number - the bigger, the closerthe approximation
p = number of WORKERS; numPerWorker = NUMPOINTS / p;
countCircle = 0; // one of these for each WORKER
// each WORKER does the following:
for (i = 0; i < numPerWorker; i++) {
generate 2 random numbers that lie inside the square; xcoord = first random number;
ycoord = second random number;
if (xcoord, ycoord) lies inside the circle
countCircle++;
}
MASTER:
receives from WORKERS their countCircle values
computes PI from these values: PI = 4.0 * countCircle / NUMPOINTS;
-
7/31/2019 Map Reduce Lecture 2
10/47
MapReduce
How to painlessly process
terabytes of data ?
-
7/31/2019 Map Reduce Lecture 2
11/47
A Brief History
Functional programming (e.g., Lisp) map() function
Applies a function to each value of a sequence
reduce() function Combines all elements of a sequence using abinary operator
-
7/31/2019 Map Reduce Lecture 2
12/47
What is MapReduce?
This model derives from the map andreduce combinators from a functionallanguage like Lisp.
Restricted parallel programming model
meant for large clusters User implements Map() and Reduce()
Parallel computing framework Libraries take care of EVERYTHING else
Parallelization
Fault Tolerance
Data Distribution
Load Balancing
Useful model for many practical tasks
-
7/31/2019 Map Reduce Lecture 2
13/47
Map and Reduce
Map() Process a key/value pair to generateintermediate key/value pairs
Reduce() Merge all intermediate values associated withthe same key
-
7/31/2019 Map Reduce Lecture 2
14/47
Example: Counting Words
Map() Input
Parses file and emits pairs
eg. Reduce()
Sums all values for the same key and emitseg. =>
-
7/31/2019 Map Reduce Lecture 2
15/47
MapReduce:Programming Model
How now
Brown cow
How doesIt work now
brown 1cow 1does 1How 2
it 1now 2work 1
M
M
M
M
R
R
Input Output
Map
ReduceMapReduceFramework
-
7/31/2019 Map Reduce Lecture 2
16/47
Example Use of MapReduce
Counting words in a large set of documents
map(string key, string value)
//key: document name
//value: document contents
for each word w in valueEmitIntermediate(w, 1);
reduce(string key, iterator values)
//key: word
//values: list of counts
int results = 0;for each v in values
result += ParseInt(v);
Emit(AsString(result));
-
7/31/2019 Map Reduce Lecture 2
17/47
MapReduce Examples
Distributed grep Map function emits ifword matches search criteria
Reduce function is the identity function
URL access frequency Map function processes web logs, emits
Reduce function sums values and emits
-
7/31/2019 Map Reduce Lecture 2
18/47
MapReduce:Programming Model
More formally, Map(k1,v1) --> list(k2,v2)
Reduce(k2, list(v2)) --> list(v2)
-
7/31/2019 Map Reduce Lecture 2
19/47
MapReduce Runtime System
1. Partitions input data
2. Schedules execution across a set ofmachines
3. Handles machine failure
4. Manages interprocess communication
-
7/31/2019 Map Reduce Lecture 2
20/47
MapReduce Benefits
Greatly reduces parallel programmingcomplexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing
Practical Approximately 1000 Google MapReduce jobs runeveryday.
-
7/31/2019 Map Reduce Lecture 2
21/47
Google Computing Environment
Typical Clusters contain 1000's ofmachines
Dual-processor x86's running Linux with
2-4GB memory Commodity networking
Typically 100 Mbs or 1 Gbs
IDE drives connected to
individual machines Distributed file system
-
7/31/2019 Map Reduce Lecture 2
22/47
How MapReduce Works
User to do list: indicate:
Input/output files
M: number of map tasks
R: number of reduce tasks W: number of machines
Write map and reduce functions
Submit the job
This requires no knowledge ofparallel/distributed systems!!!
What about everything else?
-
7/31/2019 Map Reduce Lecture 2
23/47
MapReduce Execution Overview
1. The user program, via the MapReducelibrary, shards the input data
UserProgramInput
Data
Shard 0Shard 1Shard 2Shard 3Shard 4Shard 5Shard 6
* Shards are typically 16-64mb in size
-
7/31/2019 Map Reduce Lecture 2
24/47
Data Distribution
Input files are split into M pieces ondistributed file system Typically ~ 64 MB blocks
Intermediate files created from maptasks are written to local disk
Output files are written to distributed filesystem
-
7/31/2019 Map Reduce Lecture 2
25/47
MapReduce Execution Overview
2. The user program creates processcopies distributed on a machine cluster.One copy will be the Master and the
others will be worker threads.
UserProgram
Master
WorkersWorkers
WorkersWorkers
Workers
-
7/31/2019 Map Reduce Lecture 2
26/47
MapReduce Resources
3. The master distributes M map and Rreduce tasks to idle workers.
M == number of shards R == the intermediate key space is divided
into R parts
MasterIdle
Worker
Message(Do_map_task)
-
7/31/2019 Map Reduce Lecture 2
27/47
Assigning Tasks
Many copies of user program are started
Tries to utilize data localization by runningmap tasks on machines with data
One instancebecomes the Master
Master finds idlemachines and
assigns them tasks
-
7/31/2019 Map Reduce Lecture 2
28/47
MapReduce Resources
4. Each map-task worker reads assignedinput shard and outputs intermediatekey/value pairs.
Output buffered in RAM.
MapworkerShard 0 Key/value pairs
-
7/31/2019 Map Reduce Lecture 2
29/47
MapReduce Execution Overview
5. Each worker flushes intermediatevalues, partitioned into R regions, todisk and notifies the Master process.
Master
Mapworker
Disk locations
LocalStorage
-
7/31/2019 Map Reduce Lecture 2
30/47
MapReduce Execution Overview
6. Master process gives disk locations toan available reduce-task worker whoreads all associated intermediate data.
Master
Reduceworker
Disk locations
remoteStorage
-
7/31/2019 Map Reduce Lecture 2
31/47
MapReduce Execution Overview
7. Each reduce-task worker sorts itsintermediate data. Calls the reducefunction, passing in unique keys andassociated key values. Reduce function
output appended to reduce-taskspartition output file.
Reduceworker
Sorts data
PartitionOutput file
-
7/31/2019 Map Reduce Lecture 2
32/47
MapReduce Execution Overview
8. Master process wakes up user processwhen all tasks have completed. Outputcontained in R output files.
wakeup UserProgram
Master
Outputfiles
-
7/31/2019 Map Reduce Lecture 2
33/47
-
7/31/2019 Map Reduce Lecture 2
34/47
Observations
No reduce can begin until map iscomplete
Tasks scheduled based on location of
data Ifmap worker fails any time beforereduce finishes, task must be completelyrerun
Master must communicate locations ofintermediate files MapReduce library does most of the hard
work for us!
Input key*value Input key*value
-
7/31/2019 Map Reduce Lecture 2
35/47
Data store 1 Data store nmap
(key 1,
values...)
(key 2,
values...)(key 3,
values...)
map
(key 1,
values...)
(key 2,
values...)(key 3,
values...)
Input key value
pairs
Input key value
pairs
== Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1,
intermediate
values
key 2,
intermediate
values
key 3,
intermediate
values
final key 1
values
final key 2
values
final key 3
values
...
-
7/31/2019 Map Reduce Lecture 2
36/47
Fault Tolerance
Workers are periodically pinged bymaster No response = failed worker
Map-task failure
Re-execute All output was stored locally
Reduce-task failure Only re-execute partially completed tasks
All output stored in the global file system
Master writes periodic checkpoints
-
7/31/2019 Map Reduce Lecture 2
37/47
Fault Tolerance
On errors, workers send last gasp UDPpacket to master Detect records that cause deterministic
crashes and skips them
Input file blocks stored on multiplemachines
When computation almost done,reschedule in-progress tasks Avoids stragglers
-
7/31/2019 Map Reduce Lecture 2
38/47
Conclusions
Simplifies large-scale computations thatfit this model
Allows user to focus on the problem
without worrying about details Computer architecture not very important
Portable model
-
7/31/2019 Map Reduce Lecture 2
39/47
MapReduce Applications
Relational operations using
-
7/31/2019 Map Reduce Lecture 2
40/47
Relational operations usingMapReduce
Enterprise application rely on structureddata processing
Same about relational data model and
SQL Parallel databases supports parallel
execution
Drawback: lack the scale and faulttolerance
MapReduce provides both
-
7/31/2019 Map Reduce Lecture 2
41/47
..
Relational join could be executed inparallel using mapreduce
E.g. given sales table and city table
compute the gross sales by city
Relational operations using
-
7/31/2019 Map Reduce Lecture 2
42/47
Relational operations usingMapReduce..
-
7/31/2019 Map Reduce Lecture 2
43/47
..
Enterprise Batch Processing using
-
7/31/2019 Map Reduce Lecture 2
44/47
Enterprise Batch Processing usingMapReduce
Enterprise context : interest in leveragingthe MapReduce model for high-throughput batch processing, analysis ofdata
-
7/31/2019 Map Reduce Lecture 2
45/47
Batch processing operations
End of day processing
Need to access and compute largedataset
Time bound Constraints: online availability of
trasaction processing system
Opportunity to accelerate batchprocessing
-
7/31/2019 Map Reduce Lecture 2
46/47
Example: revalue cust portfolios
-
7/31/2019 Map Reduce Lecture 2
47/47
References
Jeffery Dean and Sanjay Ghemawat, MapReduce: SimplifiedData Processing on Large Clusters
Josh Carter, http://multipart-mixed.com/software/mapreduce_presentation.pdf
Ralf Lammel, Google's MapReduce Programming Model
Revisited http://code.google.com/edu/parallel/mapreduce-tutorial.html