processing of big data

Processing of Big Data

Towards Efficient MapReduce Using MPI

MapReduce:Simplified Data Processing On Large Clusters

Su Wu wusu5545@qq.com

Computational Engineering 21762068

Pu Li lipujethro@gmail.com

Information und Kummunikationstechnik 21888302

Introduction of MPI

Massage Passing Interface

Example

#include <mpi.h>

#include <stdio.h>

#include <string.h>

#define BUFSIZE 128

#define TAG 0

int main(int argc, char *argv[])

char idstr[32];

char buff[BUFSIZE];

int numprocs;

int myid;

int i;

MPI_Status stat;

/* MPI programs start with MPI_Init; all 'N' processes exist thereafter */

MPI_Init(&argc,&argv);

/* find out how big the SPMD world is */

MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

/* and this processes' rank is */

MPI_Comm_rank(MPI_COMM_WORLD,&myid);

/* At this point, all programs are running equivalently, the rank

distinguishes the roles of the programs in the SPMD model, with

rank 0 often used specially... */

if(myid == 0)

printf("%d: We have %d processors\n", myid, numprocs);

for(i=1;i<numprocs;i++)

sprintf(buff, "Hello %d! ", i);

MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);

for(i=1;i<numprocs;i++)

MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);

printf("%d: %s\n", myid, buff);

/* receive from rank 0: */

MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);

sprintf(idstr, "Processor %d ", myid);

strncat(buff, idstr, BUFSIZE-1);

strncat(buff, "reporting for duty", BUFSIZE-1);

/* send to rank 0: */

MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);

/* MPI programs end with MPI Finalize; this is a weak synchronization point */

MPI_Finalize();

return 0;

MPI Example

When run with two processors this gives the following output.

0: We have 2 processors

0: Hello 1! Processor 1 reporting for duty

int MPI_Send (

message , /* actual information sent */

length, /* length of the message */

datatype, /* MPI datatype of the message */

destination, /* rank of the processor getting the message */

tag, /* tag helps sort messages - likely an int and the same as in MPI_Recv*/

MPI_Comm /* almost alway MPI_Comm_World */

int MPI_Recv (

message , /* actual information received*/

length, /* length of the message */

datatype, /* MPI datatype of the message */

source, /* rank of the processor sending the message */

tag, /* tag helps sort messages - likely an int and the same as in MPI_Recv* /

MPI_Comm, /* almost alway MPI_Comm_World */

Status /* a data structure that contains info on what was recv'd */

Syntax for MPI_Send and MPI_Recv

Introduction of MapReduce

A programmimg model for data-parallel applications

Key functionality: Map and Reduce

MapReduce Communication Scheme

Example

reduce

local combine local combine local combine local combine

k1 v k2 v k3 v k1 v k3 v k4 v k2 v k3 v k4 v k1 v k2 v k4 v

MapReduce Communication Scheme

Example Consider the problem of counting the number of occurences of the

letter ‘a’ ‘b’ and ‘c’ in a large collection of documents.

“…all… boys… could……and…but…process……map…sub…cut…”

map mapmap

reduce

a|1 b|1 c|1 a|1 b|1 c|1 a|1 b|1 c|1

Output 3 Output 3 Output 3

MPI Implementation

Non-Blocking Collectives

Performance Results

Conclusion

Towards Efficient MapReduce Using MPI

MPI Implementation

Two common optimization possibilities

(1)collective operations

(2)overlapping communication and computation

10 2 43

message

Original MPI Implementation

message

10 2 43

Using Nonblocking Collectives

( a) Communication Overhead of Static (b) Time to Solution of Dynamic Workload Workload

Performance Results

Missing of MPI

Default error handling

The support of higher languages

MapReduce and MPI are traditionally somewhat disjoint

Nonblocking collective operations 25% speedup

Conclusion

MapReduce: Simplified Data Processing On

Large Clusters

Execution of MapReduce in cluster-based environment

Some conceptions about execution of MapReduce

Performance of MapReduce

Conclusion

Program

Split 2

Split 1

Split 0

Split 3

Split 4

Master

(1) fork

workerworker

(1) fork (1) fork

(3) read

(4)local write

Intermediate files

(on local disks)

worker

(2) assign

Reduce

Output

File 0

Output

File 1

(6) write

Output

Execution Overview

Some conceptions about execution of MapReduce

Master Data Structures

Fault Tolerance (Handling Worker Failures)

Locality

Backup Tasks

Performance of MapReduce

Grep: looking for a particular pattern through 1

terabyte of data

Sort: Sorts 1 terabyte of data

Grep: looking for a particular pattern through 1

terabyte of data

Data transfer rate over time

M=15000

Sort: Sorts 1 terabyte of data

M=15000

R=4000

Data transfer rate over time

Conclusion Advantages of MapReduce model:

easy to use

many problems are easily expressible as MapReduce model

need much less machines

Successfully used at Google for many different purpose

Processing of satellite imagery data

large-scale graph computations

clustering problems for the Google News and Froogle products

processing of big data

Documents

interstage big data complex event processing server v1.0...

comp6237 data mining big data processing

processing big...

a study on big data security issues and challenges · 2.5...

processing big data with pentaho - presentation€¦ ·...

big-data analytics architecture for businesses: open ... ·...

in stream big data processing

automata processing: accelerating big data

big data processing with spark

indexing and processing big data

big data dive amazon emr processing

processing big data with hive - github

big iron and parallel processing, usarray data processing...

introduction to big data, big data processing, and big

big data processing streaming data (velocity)

processing big data using secure hdfs

oracle big data discovery cloud service · oracle® big...

stream processing with big data: knowledgent big data...

processing big data with aws mapreduce

big data processing