map reduce lecture 2

Upload: darpan-paloda

Post on 05-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Map Reduce Lecture 2

    1/47

    Parallel programming,

    Mapreduce modelUNIT II

  • 7/31/2019 Map Reduce Lecture 2

    2/47

    Serial vs. Parallel Programming

    A serial program consist of a sequence ofinstructions, where each instructionexecuted one after the other

    In a parallel program, the processing isbroken up into parts, each of which can beexecuted concurrently.

  • 7/31/2019 Map Reduce Lecture 2

    3/47

    The Basics Parallel Programming

    Identifying sets of tasks that can runconcurrently and/or paritions of datathat can be processed concurrently

    Sometimes it's just not possible:Fibonacci function

    A common situation is having a large

    amount of consistent data which mustbe processed.

  • 7/31/2019 Map Reduce Lecture 2

    4/47

    huge array which can be brokenup into sub-arrays

  • 7/31/2019 Map Reduce Lecture 2

    5/47

    implementation technique:master/worker

    The MASTER: initializes the array and splits it upaccording to the number of availableWORKERS

    sends each WORKER its subarray receives the results from each WORKER

    The WORKER: receives the subarray from the MASTER

    performs processing on the subarray returns results to MASTER

  • 7/31/2019 Map Reduce Lecture 2

    6/47

    An example of theMASTER/WORKER technique

    Approximating pi

  • 7/31/2019 Map Reduce Lecture 2

    7/47

    Approximating pi..

    The area of the square, denoted As = (2r)2or 4r2. The area of the circle, denoted Ac, is

    pi * r2. So:pi = Ac / r2

    As = 4r2

    r2 = As / 4pi = 4 * Ac / As

  • 7/31/2019 Map Reduce Lecture 2

    8/47

    Parallelize this method

    Randomly generate points in the square

    Count the number of generated points that

    are both in the circle and in the square

    r = the number of points in the circledivided by the number of points in the

    square

    PI = 4 * rr

  • 7/31/2019 Map Reduce Lecture 2

    9/47

    NUMPOINTS = 100000; // some large number - the bigger, the closerthe approximation

    p = number of WORKERS; numPerWorker = NUMPOINTS / p;

    countCircle = 0; // one of these for each WORKER

    // each WORKER does the following:

    for (i = 0; i < numPerWorker; i++) {

    generate 2 random numbers that lie inside the square; xcoord = first random number;

    ycoord = second random number;

    if (xcoord, ycoord) lies inside the circle

    countCircle++;

    }

    MASTER:

    receives from WORKERS their countCircle values

    computes PI from these values: PI = 4.0 * countCircle / NUMPOINTS;

  • 7/31/2019 Map Reduce Lecture 2

    10/47

    MapReduce

    How to painlessly process

    terabytes of data ?

  • 7/31/2019 Map Reduce Lecture 2

    11/47

    A Brief History

    Functional programming (e.g., Lisp) map() function

    Applies a function to each value of a sequence

    reduce() function Combines all elements of a sequence using abinary operator

  • 7/31/2019 Map Reduce Lecture 2

    12/47

    What is MapReduce?

    This model derives from the map andreduce combinators from a functionallanguage like Lisp.

    Restricted parallel programming model

    meant for large clusters User implements Map() and Reduce()

    Parallel computing framework Libraries take care of EVERYTHING else

    Parallelization

    Fault Tolerance

    Data Distribution

    Load Balancing

    Useful model for many practical tasks

  • 7/31/2019 Map Reduce Lecture 2

    13/47

    Map and Reduce

    Map() Process a key/value pair to generateintermediate key/value pairs

    Reduce() Merge all intermediate values associated withthe same key

  • 7/31/2019 Map Reduce Lecture 2

    14/47

    Example: Counting Words

    Map() Input

    Parses file and emits pairs

    eg. Reduce()

    Sums all values for the same key and emitseg. =>

  • 7/31/2019 Map Reduce Lecture 2

    15/47

    MapReduce:Programming Model

    How now

    Brown cow

    How doesIt work now

    brown 1cow 1does 1How 2

    it 1now 2work 1

    M

    M

    M

    M

    R

    R

    Input Output

    Map

    ReduceMapReduceFramework

  • 7/31/2019 Map Reduce Lecture 2

    16/47

    Example Use of MapReduce

    Counting words in a large set of documents

    map(string key, string value)

    //key: document name

    //value: document contents

    for each word w in valueEmitIntermediate(w, 1);

    reduce(string key, iterator values)

    //key: word

    //values: list of counts

    int results = 0;for each v in values

    result += ParseInt(v);

    Emit(AsString(result));

  • 7/31/2019 Map Reduce Lecture 2

    17/47

    MapReduce Examples

    Distributed grep Map function emits ifword matches search criteria

    Reduce function is the identity function

    URL access frequency Map function processes web logs, emits

    Reduce function sums values and emits

  • 7/31/2019 Map Reduce Lecture 2

    18/47

    MapReduce:Programming Model

    More formally, Map(k1,v1) --> list(k2,v2)

    Reduce(k2, list(v2)) --> list(v2)

  • 7/31/2019 Map Reduce Lecture 2

    19/47

    MapReduce Runtime System

    1. Partitions input data

    2. Schedules execution across a set ofmachines

    3. Handles machine failure

    4. Manages interprocess communication

  • 7/31/2019 Map Reduce Lecture 2

    20/47

    MapReduce Benefits

    Greatly reduces parallel programmingcomplexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing

    Practical Approximately 1000 Google MapReduce jobs runeveryday.

  • 7/31/2019 Map Reduce Lecture 2

    21/47

    Google Computing Environment

    Typical Clusters contain 1000's ofmachines

    Dual-processor x86's running Linux with

    2-4GB memory Commodity networking

    Typically 100 Mbs or 1 Gbs

    IDE drives connected to

    individual machines Distributed file system

  • 7/31/2019 Map Reduce Lecture 2

    22/47

    How MapReduce Works

    User to do list: indicate:

    Input/output files

    M: number of map tasks

    R: number of reduce tasks W: number of machines

    Write map and reduce functions

    Submit the job

    This requires no knowledge ofparallel/distributed systems!!!

    What about everything else?

  • 7/31/2019 Map Reduce Lecture 2

    23/47

    MapReduce Execution Overview

    1. The user program, via the MapReducelibrary, shards the input data

    UserProgramInput

    Data

    Shard 0Shard 1Shard 2Shard 3Shard 4Shard 5Shard 6

    * Shards are typically 16-64mb in size

  • 7/31/2019 Map Reduce Lecture 2

    24/47

    Data Distribution

    Input files are split into M pieces ondistributed file system Typically ~ 64 MB blocks

    Intermediate files created from maptasks are written to local disk

    Output files are written to distributed filesystem

  • 7/31/2019 Map Reduce Lecture 2

    25/47

    MapReduce Execution Overview

    2. The user program creates processcopies distributed on a machine cluster.One copy will be the Master and the

    others will be worker threads.

    UserProgram

    Master

    WorkersWorkers

    WorkersWorkers

    Workers

  • 7/31/2019 Map Reduce Lecture 2

    26/47

    MapReduce Resources

    3. The master distributes M map and Rreduce tasks to idle workers.

    M == number of shards R == the intermediate key space is divided

    into R parts

    MasterIdle

    Worker

    Message(Do_map_task)

  • 7/31/2019 Map Reduce Lecture 2

    27/47

    Assigning Tasks

    Many copies of user program are started

    Tries to utilize data localization by runningmap tasks on machines with data

    One instancebecomes the Master

    Master finds idlemachines and

    assigns them tasks

  • 7/31/2019 Map Reduce Lecture 2

    28/47

    MapReduce Resources

    4. Each map-task worker reads assignedinput shard and outputs intermediatekey/value pairs.

    Output buffered in RAM.

    MapworkerShard 0 Key/value pairs

  • 7/31/2019 Map Reduce Lecture 2

    29/47

    MapReduce Execution Overview

    5. Each worker flushes intermediatevalues, partitioned into R regions, todisk and notifies the Master process.

    Master

    Mapworker

    Disk locations

    LocalStorage

  • 7/31/2019 Map Reduce Lecture 2

    30/47

    MapReduce Execution Overview

    6. Master process gives disk locations toan available reduce-task worker whoreads all associated intermediate data.

    Master

    Reduceworker

    Disk locations

    remoteStorage

  • 7/31/2019 Map Reduce Lecture 2

    31/47

    MapReduce Execution Overview

    7. Each reduce-task worker sorts itsintermediate data. Calls the reducefunction, passing in unique keys andassociated key values. Reduce function

    output appended to reduce-taskspartition output file.

    Reduceworker

    Sorts data

    PartitionOutput file

  • 7/31/2019 Map Reduce Lecture 2

    32/47

    MapReduce Execution Overview

    8. Master process wakes up user processwhen all tasks have completed. Outputcontained in R output files.

    wakeup UserProgram

    Master

    Outputfiles

  • 7/31/2019 Map Reduce Lecture 2

    33/47

  • 7/31/2019 Map Reduce Lecture 2

    34/47

    Observations

    No reduce can begin until map iscomplete

    Tasks scheduled based on location of

    data Ifmap worker fails any time beforereduce finishes, task must be completelyrerun

    Master must communicate locations ofintermediate files MapReduce library does most of the hard

    work for us!

    Input key*value Input key*value

  • 7/31/2019 Map Reduce Lecture 2

    35/47

    Data store 1 Data store nmap

    (key 1,

    values...)

    (key 2,

    values...)(key 3,

    values...)

    map

    (key 1,

    values...)

    (key 2,

    values...)(key 3,

    values...)

    Input key value

    pairs

    Input key value

    pairs

    == Barrier == : Aggregates intermediate values by output key

    reduce reduce reduce

    key 1,

    intermediate

    values

    key 2,

    intermediate

    values

    key 3,

    intermediate

    values

    final key 1

    values

    final key 2

    values

    final key 3

    values

    ...

  • 7/31/2019 Map Reduce Lecture 2

    36/47

    Fault Tolerance

    Workers are periodically pinged bymaster No response = failed worker

    Map-task failure

    Re-execute All output was stored locally

    Reduce-task failure Only re-execute partially completed tasks

    All output stored in the global file system

    Master writes periodic checkpoints

  • 7/31/2019 Map Reduce Lecture 2

    37/47

    Fault Tolerance

    On errors, workers send last gasp UDPpacket to master Detect records that cause deterministic

    crashes and skips them

    Input file blocks stored on multiplemachines

    When computation almost done,reschedule in-progress tasks Avoids stragglers

  • 7/31/2019 Map Reduce Lecture 2

    38/47

    Conclusions

    Simplifies large-scale computations thatfit this model

    Allows user to focus on the problem

    without worrying about details Computer architecture not very important

    Portable model

  • 7/31/2019 Map Reduce Lecture 2

    39/47

    MapReduce Applications

    Relational operations using

  • 7/31/2019 Map Reduce Lecture 2

    40/47

    Relational operations usingMapReduce

    Enterprise application rely on structureddata processing

    Same about relational data model and

    SQL Parallel databases supports parallel

    execution

    Drawback: lack the scale and faulttolerance

    MapReduce provides both

  • 7/31/2019 Map Reduce Lecture 2

    41/47

    ..

    Relational join could be executed inparallel using mapreduce

    E.g. given sales table and city table

    compute the gross sales by city

    Relational operations using

  • 7/31/2019 Map Reduce Lecture 2

    42/47

    Relational operations usingMapReduce..

  • 7/31/2019 Map Reduce Lecture 2

    43/47

    ..

    Enterprise Batch Processing using

  • 7/31/2019 Map Reduce Lecture 2

    44/47

    Enterprise Batch Processing usingMapReduce

    Enterprise context : interest in leveragingthe MapReduce model for high-throughput batch processing, analysis ofdata

  • 7/31/2019 Map Reduce Lecture 2

    45/47

    Batch processing operations

    End of day processing

    Need to access and compute largedataset

    Time bound Constraints: online availability of

    trasaction processing system

    Opportunity to accelerate batchprocessing

  • 7/31/2019 Map Reduce Lecture 2

    46/47

    Example: revalue cust portfolios

  • 7/31/2019 Map Reduce Lecture 2

    47/47

    References

    Jeffery Dean and Sanjay Ghemawat, MapReduce: SimplifiedData Processing on Large Clusters

    Josh Carter, http://multipart-mixed.com/software/mapreduce_presentation.pdf

    Ralf Lammel, Google's MapReduce Programming Model

    Revisited http://code.google.com/edu/parallel/mapreduce-tutorial.html