processing of big data

20
Processing of Big Data Towards Efficient MapReduce Using MPI MapReduce:Simplified Data Processing On Large Clusters Su Wu [email protected] Computational Engineering 21762068 Pu Li [email protected] Information und Kummunikationstechnik 21888302

Upload: others

Post on 19-Nov-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Processing of Big Data

Processing of Big Data

Towards Efficient MapReduce Using MPI

MapReduce:Simplified Data Processing On Large Clusters

Su Wu [email protected]

Computational Engineering 21762068

Pu Li [email protected]

Information und Kummunikationstechnik 21888302

Page 2: Processing of Big Data

2

Introduction of MPI

Massage Passing Interface

Example

Page 3: Processing of Big Data

3

#include <mpi.h>

#include <stdio.h>

#include <string.h>

#define BUFSIZE 128

#define TAG 0

int main(int argc, char *argv[])

{

char idstr[32];

char buff[BUFSIZE];

int numprocs;

int myid;

int i;

MPI_Status stat;

/* MPI programs start with MPI_Init; all 'N' processes exist thereafter */

MPI_Init(&argc,&argv);

/* find out how big the SPMD world is */

MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

/* and this processes' rank is */

MPI_Comm_rank(MPI_COMM_WORLD,&myid);

/* At this point, all programs are running equivalently, the rank

distinguishes the roles of the programs in the SPMD model, with

rank 0 often used specially... */

if(myid == 0)

{

printf("%d: We have %d processors\n", myid, numprocs);

for(i=1;i<numprocs;i++)

{

sprintf(buff, "Hello %d! ", i);

MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);

}

for(i=1;i<numprocs;i++)

{

MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);

printf("%d: %s\n", myid, buff);

}

}

else

{

/* receive from rank 0: */

MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);

sprintf(idstr, "Processor %d ", myid);

strncat(buff, idstr, BUFSIZE-1);

strncat(buff, "reporting for duty", BUFSIZE-1);

/* send to rank 0: */

MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);

}

/* MPI programs end with MPI Finalize; this is a weak synchronization point */

MPI_Finalize();

return 0;

}

MPI Example

When run with two processors this gives the following output.

0: We have 2 processors

0: Hello 1! Processor 1 reporting for duty

Page 4: Processing of Big Data

4

int MPI_Send (

message , /* actual information sent */

length, /* length of the message */

datatype, /* MPI datatype of the message */

destination, /* rank of the processor getting the message */

tag, /* tag helps sort messages - likely an int and the same as in MPI_Recv*/

MPI_Comm /* almost alway MPI_Comm_World */

)

int MPI_Recv (

message , /* actual information received*/

length, /* length of the message */

datatype, /* MPI datatype of the message */

source, /* rank of the processor sending the message */

tag, /* tag helps sort messages - likely an int and the same as in MPI_Recv* /

MPI_Comm, /* almost alway MPI_Comm_World */

Status /* a data structure that contains info on what was recv'd */

)

Syntax for MPI_Send and MPI_Recv

Page 5: Processing of Big Data

5

Introduction of MapReduce

A programmimg model for data-parallel applications

Key functionality: Map and Reduce

MapReduce Communication Scheme

Example

Page 6: Processing of Big Data

6

k

v

k

v

k

vk

v

k

v

k

v

k

v

k

vk

v

k

v

k

v

map

reduce

k1

PE0

map

reduce

k2

PE1

map

reduce

k3

PE2

map

reduce

k4

PE3

local combine local combine local combine local combine

k1 v k2 v k3 v k1 v k3 v k4 v k2 v k3 v k4 v k1 v k2 v k4 v

MapReduce Communication Scheme

Page 7: Processing of Big Data

7

Example Consider the problem of counting the number of occurences of the

letter ‘a’ ‘b’ and ‘c’ in a large collection of documents.

“…all… boys… could……and…but…process……map…sub…cut…”

map mapmap

reduce

a

reduce

b

reduce

c

a|1 b|1 c|1 a|1 b|1 c|1 a|1 b|1 c|1

Output 3 Output 3 Output 3

PE

1

PE

2

PE

3

Page 8: Processing of Big Data

8

MPI Implementation

Non-Blocking Collectives

Performance Results

Conclusion

Towards Efficient MapReduce Using MPI

Page 9: Processing of Big Data

9

MPI Implementation

Two common optimization possibilities

(1)collective operations

(2)overlapping communication and computation

Page 10: Processing of Big Data

10

10 2 43

work

wait

message

Original MPI Implementation

Page 11: Processing of Big Data

11

work

wait

message

10 2 43

Using Nonblocking Collectives

Page 12: Processing of Big Data

12

( a) Communication Overhead of Static (b) Time to Solution of Dynamic Workload Workload

Performance Results

Page 13: Processing of Big Data

13

Missing of MPI

Default error handling

The support of higher languages

MapReduce and MPI are traditionally somewhat disjoint

Nonblocking collective operations 25% speedup

Conclusion

Page 14: Processing of Big Data

14

MapReduce: Simplified Data Processing On

Large Clusters

Execution of MapReduce in cluster-based environment

Some conceptions about execution of MapReduce

Performance of MapReduce

Conclusion

Page 15: Processing of Big Data

15

User

Program

Split 2

Split 1

Split 0

Split 3

Split 4

Master

(1) fork

workerworker

(1) fork (1) fork

(3) read

Input

files

(4)local write

Intermediate files

(on local disks)

worker

worker

worker

(2) assign

map

(2) assign

map

Map

phase

Reduce

phase

Output

File 0

Output

File 1

(6) write

Output

files

Execution Overview

Page 16: Processing of Big Data

16

Some conceptions about execution of MapReduce

Master Data Structures

Fault Tolerance (Handling Worker Failures)

Locality

Backup Tasks

Page 17: Processing of Big Data

17

Performance of MapReduce

Grep: looking for a particular pattern through 1

terabyte of data

Sort: Sorts 1 terabyte of data

Page 18: Processing of Big Data

18

Grep: looking for a particular pattern through 1

terabyte of data

Data transfer rate over time

M=15000

R=1

Page 19: Processing of Big Data

19

Sort: Sorts 1 terabyte of data

M=15000

R=4000

Data transfer rate over time

Page 20: Processing of Big Data

20

Conclusion Advantages of MapReduce model:

easy to use

many problems are easily expressible as MapReduce model

need much less machines

Successfully used at Google for many different purpose

Processing of satellite imagery data

large-scale graph computations

clustering problems for the Google News and Froogle products