terasort using saga-mapreduce given by: sharath maddineni cct: center for computation &...

Terasort UsingSAGA-MapReduce

Given by: Sharath Maddineni

CCT: Center for Computation & Technology

Why Terasort?

• Sorting the large datasets in scientific computations.

• Google processes around 20 Petabytes of data per day using MapReduce.

• So, Google may sort the huge datasets containing WebPages makes the searching and retrieval faster.

Center CCT: Center for Computation & Technology

Introduction

• Sort Benchmark (http://sortbenchmark.org/)

• Google won the 2010 competition, Yahoo Hadoop In 2009

• But, Google sorting is limited to Google File System(GFS), and Yahoo is tied to Yahoo-Hadoop File System(HDFS)

• SAGA-MapReduce is infrastructure independent.


http://sortbenchmark.org/

SAGA MapReduce Execution Overview

1. Start the Master with a executable linked to SAGA-MapReduce and creates advert directory

2. The master looks the InputFormat specified in the JobDescription to chunk the input data.

3. The master spawns workers on the host machines specified in the configuration file using the SAGA Job API

4. Worker puts its status information into an advert directory and will communicate with master using this advert service.

5. Workers will process the chunks assigned by master using Map() and partition the Data according the partition function

6. When all chunks mapping is done master moves to reduce Phase.

7. In the reduce, the master assigns sets of partitions to be reduced to idle workers.


Slide Title


Terasort

• Sort-benchmark’s provides a “Gensort” program to generate Data Records

• Data Format• Each Record has 100 bytes ASCII values contains

where 10 bytes random key and rest is the value .• 10^10, 100 byte-records for terabyte of data

• All the records are sorted according to this 10 byte key.


Terasort SAGA Map-Reduce

• Similar to SAGA-MapReduce Except the partition list is generated before launching the master

• The partition list generated will make sure that the keys in map phase goes into partition of its range.

• This will spread the keys evenly across all the partitions.


Distributed Workers for Terasort

• Cyder and Cyd01 machines as workers

• Prerequisites:– SSH password less login from Master machine to Worker

machines.– Fuser Mount the Input and Output Data Locations on each

machine.


Results


• X-Axis -> Data set size in MB• Y-Axis ->Time to solution in seconds

• Increasing the input Data size• Constant Number of workers (3) (Both Master and Worker on Cyd01 )

Operating System : Redhat 5.5Architecture : x86_64 Memory : 8 GBCPU Type : Dual-Core AMD OpteronCompiler Version : gcc version 4.4.3, Boost Version : 1.40,

Results cont…

• Constant Input File Size(400MB, 6 Chunks, 5 partitions) • Increasing number of workers


• X-Axis -> Number of workers• Y-Axis ->Time to solution in seconds

Operating System : Ubuntu 10.04Architecture : x86_64 AMDMemory : 63 GBCPU Type : 6-Core AMD OpteronCompiler Version : gcc version 4.4.3, Boost Version : 1.40,

Results cont…

• Distributed workers (2 workers, 1 chunk(10mb), 5 partitions)• Cyd01 and Cyder are used


Case 1 : Master, Worker and Data on same machineCase 2 : Remote Master , Data and workers on same machineCase 3 : Remote Master, Remote data for one worker and local Data for one worker Case 4 : Remote Master, Remote Data for all workers

• X-Axis -> Cases• Y-Axis ->Time to solution in seconds

SAGA Map-Reduce Usability

• Usable for users who have some familiarity with the C++,SAGA and prior knowledge of MapReduce.

• Sufficiently documented. However, some important details about mounting the input and out put with distributed computing were missing

• Tested on – RHEL 4,5 and Ubuntu 10.04– SAGA 1.4.1 and 1.5– Boost Version 1.40


Future Work

• Currently MapReduce only supports Launching worker through forking Localhost and SSH

• SAGA- BigJob can be used to launch the workers instead– Helps in running MapReduce distributed over LONI Machines– But mounting directories is a problem over LONI.


Thank You


terasort using saga-mapreduce given by: sharath maddineni cct: center for computation &...

Documents

computation technologyxaxis

master spawns workers

master machine

master moves

master assigns

advert directorythe

gcc version

boost version