terasort using saga-mapreduce given by: sharath maddineni cct: center for computation &...

15
Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Upload: patience-robbins

Post on 12-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Terasort UsingSAGA-MapReduce

Given by: Sharath Maddineni

CCT: Center for Computation & Technology

Page 2: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Why Terasort?

• Sorting the large datasets in scientific computations.

• Google processes around 20 Petabytes of data per day using MapReduce.

• So, Google may sort the huge datasets containing WebPages makes the searching and retrieval faster.

Center CCT: Center for Computation & Technology

Page 3: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Introduction

• Sort Benchmark (http://sortbenchmark.org/)

• Google won the 2010 competition, Yahoo Hadoop In 2009

• But, Google sorting is limited to Google File System(GFS), and Yahoo is tied to Yahoo-Hadoop File System(HDFS)

• SAGA-MapReduce is infrastructure independent.

Center CCT: Center for Computation & Technology

Page 4: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

SAGA MapReduce Execution Overview

1. Start the Master with a executable linked to SAGA-MapReduce and creates advert directory

2. The master looks the InputFormat specified in the JobDescription to chunk the input data.

3. The master spawns workers on the host machines specified in the configuration file using the SAGA Job API

4. Worker puts its status information into an advert directory and will communicate with master using this advert service.

5. Workers will process the chunks assigned by master using Map() and partition the Data according the partition function

6. When all chunks mapping is done master moves to reduce Phase.

7. In the reduce, the master assigns sets of partitions to be reduced to idle workers.

Center CCT: Center for Computation & Technology

Page 5: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Slide Title

Center CCT: Center for Computation & Technology

Page 6: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Terasort

• Sort-benchmark’s provides a “Gensort” program to generate Data Records

• Data Format• Each Record has 100 bytes ASCII values contains

where 10 bytes random key and rest is the value .• 10^10, 100 byte-records for terabyte of data

• All the records are sorted according to this 10 byte key.

Center CCT: Center for Computation & Technology

Page 7: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Terasort SAGA Map-Reduce

• Similar to SAGA-MapReduce Except the partition list is generated before launching the master

• The partition list generated will make sure that the keys in map phase goes into partition of its range.

• This will spread the keys evenly across all the partitions.

Center CCT: Center for Computation & Technology

Page 8: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Center CCT: Center for Computation & Technology

Page 9: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Distributed Workers for Terasort

• Cyder and Cyd01 machines as workers

• Prerequisites:– SSH password less login from Master machine to Worker

machines.– Fuser Mount the Input and Output Data Locations on each

machine.

Center CCT: Center for Computation & Technology

Page 10: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Results

Center CCT: Center for Computation & Technology

• X-Axis -> Data set size in MB• Y-Axis ->Time to solution in seconds

• Increasing the input Data size• Constant Number of workers (3) (Both Master and Worker on Cyd01 )

Operating System : Redhat 5.5Architecture : x86_64 Memory : 8 GBCPU Type : Dual-Core AMD OpteronCompiler Version : gcc version 4.4.3, Boost Version : 1.40,

Page 11: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Results cont…

• Constant Input File Size(400MB, 6 Chunks, 5 partitions) • Increasing number of workers

Center CCT: Center for Computation & Technology

• X-Axis -> Number of workers• Y-Axis ->Time to solution in seconds

Operating System : Ubuntu 10.04Architecture : x86_64 AMDMemory : 63 GBCPU Type : 6-Core AMD OpteronCompiler Version : gcc version 4.4.3, Boost Version : 1.40,

Page 12: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Results cont…

• Distributed workers (2 workers, 1 chunk(10mb), 5 partitions)• Cyd01 and Cyder are used

Center CCT: Center for Computation & Technology

Case 1 : Master, Worker and Data on same machineCase 2 : Remote Master , Data and workers on same machineCase 3 : Remote Master, Remote data for one worker and local Data for one worker Case 4 : Remote Master, Remote Data for all workers

• X-Axis -> Cases• Y-Axis ->Time to solution in seconds

Page 13: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

SAGA Map-Reduce Usability

• Usable for users who have some familiarity with the C++,SAGA and prior knowledge of MapReduce.

• Sufficiently documented. However, some important details about mounting the input and out put with distributed computing were missing

• Tested on – RHEL 4,5 and Ubuntu 10.04– SAGA 1.4.1 and 1.5– Boost Version 1.40

Center CCT: Center for Computation & Technology

Page 14: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Future Work

• Currently MapReduce only supports Launching worker through forking Localhost and SSH

• SAGA- BigJob can be used to launch the workers instead– Helps in running MapReduce distributed over LONI Machines– But mounting directories is a problem over LONI.

Center CCT: Center for Computation & Technology

Page 15: Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology

Thank You

Center CCT: Center for Computation & Technology