a hierarchical framework for cross-domain mapreduce execution · a hierarchical framework for...

A Hierarchical Framework for Cross-Domain MapReduce Execution

Yuan Luo1, Zhenhua Guo1, Yiming Sun1, Beth Plale1, Judy Qiu1, Wilfred W. Li 2 1 School of Informatics and Computing, Indiana University, Bloomington, IN, 47405

2 San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA, 92093

{yuanluo, zhguo, yimsun, plale, xqiu}@indiana.edu, [email protected] ABSTRACT The MapReduce programming model provides an easy way to execute pleasantly parallel applications. Many data-intensive life science applications fit this programming model and benefit from the scalability that can be delivered using this model. One such application is AutoDock, which consists of a suite of automated tools for predicting the bound conformations of flexible ligands to macromolecular targets. However, researchers also need sufficient computation and storage resources to fully enjoy the benefit of MapReduce. For example, a typical AutoDock based virtual screening experiment usually consists of a very large number of docking processes from multiple ligands and is often time consuming to run on a single MapReduce cluster. Although commercial clouds can provide virtually unlimited computation and storage resources on-demand, due to financial, security and possibly other concerns, many researchers still run experiments on a number of small clusters with limited number of nodes that cannot unleash the full power of MapReduce. In this paper, we present a hierarchical MapReduce framework that gathers computation resources from different clusters and run MapReduce jobs across them. The global controller in our framework splits the data set and dispatches them to multiple “local” MapReduce clusters, and balances the workload by assigning tasks in accordance to the capabilities of each cluster and of each node. The local results are then returned back to the global controller for global reduction. Our experimental evaluation using AutoDock over MapReduce shows that our load-balancing algorithm makes promising workload distribution across multiple clusters, and thus minimizes overall execution time span of the entire MapReduce execution.

Categories and Subject Descriptors C.2.4 [Computer-Communication Networks]: Distributed Systems – Distributed applications.

General Terms: Design, Experimentation, Performance

Keywords: AutoDock, Cloud, FutureGrid, Hierarchical MapReduce, Multi-Cluster

1. INTRODUCTION Life science applications are often both compute intensive and data intensive. They consume large amount of CPU cycles while processing massive data sets that are either in large group of small

files or naturally splittable. These kinds of applications ideally fit in the MapReduce [2] programming model. MapReduce differs from the traditional HPC model in that it does not distinguish computation nodes and storage nodes so each node is responsible for both computation and storage. Obvious advantages include better fault tolerance, scalability and data locality scheduling. The MapReduce model has been applied to life science applications by many researchers. Qiu et al. [15] describe their work to implement various clustering algorithm using MapReduce.

AutoDock [13] is a suite of automated docking tools for predicting the bound conformations of flexible ligands to macromolecular targets. It is designed to predict how small molecules of substrates or drug candidates bind to a receptor of known 3D structure. Running AutoDock requires several pre-docking steps, e.g., ligand and receptor preparation, and grid map calculations, before the actual docking process can take place. There are desktop GUI tools for processing the individual AutoDock steps, such as AutoDockTools (ADT) [13] and BDT [19], but they do not have the capability to efficiently process thousands to millions of docking processes. Ultimately, the goal of a docking experiment is to illustrate the docked result in the context of macromolecule, explaining the docking in terms of the overall energy landscape. Each AutoDock calculation results in a docking log file containing information about the best docked ligand conformation found from each of the docking runs specified in the docking parameter file (dpf). The results can then be summarized interactively using the desktop tools such as AutoDockTools or with a python script. A typical AutoDock based virtual screening consists of a large number of docking processes from multiple targeted ligands and would take a large amount of time to finish. However, the docking processes are data independent, so if several CPU cores are available, these processes can be carried out in parallel to shorten the overall makespan of multiple AutoDock runs.

Workflow based approaches can also be used to run multiple AutoDock instances; however, MapReduce runtime can automate data partitioning for parallel execution. Therefore our paper focuses on extending the MapReduce model for parallel execution of applications across multiple clusters.

Cloud computing can provide scalable computational and storage resources as needed. With the correct application model and implementation, clouds enable applications to scale out with relative ease. Because of the “pleasantly parallel” nature of the MapReduce programming model, it has become a popular model for deploying and executing applications in a cloud, and running multiple AutoDock jobs certainly fits well for MapReduce. However, many researchers still shun away from clouds for different reasons. For example, some researchers may not feel comfortable letting their data sit in shared storage space with users worldwide, while others may have large amounts of data and computation that would be financially too expensive to move

into the cloud. It is more typical for a researcher to have access to several research clusters hosted at his/her lab or institute. These clusters usually consist of only a few nodes, and the nodes in one cluster may be very different from those in another cluster in terms of various specifications including CPU frequency, number of cores, cache size, memory size, and storage capacity. Commonly a MapReduce framework is deployed in a single cluster to run jobs, but any such individual cluster does not provide enough resources to deliver significant performance gain. For example, at Indiana University we have access to IU Quarry, FutureGrid [5], and Teragrid [17] clusters but each cluster imposes limit on the maximum number of nodes a user can uses at any time. If these isolated clusters can work together, they collectively become more powerful.

Unfortunately, users cannot directly deploy a MapReduce framework such as Hadoop on top of these clusters to form a single larger MapReduce cluster. Typically the internal nodes of a cluster are not directly reachable from outside. However, MapReduce requires the master node to directly communicate with any slave node, which is also one of the reasons why MapReduce frameworks are usually deployed within a single cluster. Therefore, one challenge is to make multiple clusters act collaboratively as one so it can more efficiently run MapReduce. There are two possible approaches to address this challenge. One is to unify the underlying physical clusters as a single virtual cluster by adding a special infrastructure layer, and run MapReduce on top of this virtual cluster. The other is to make the MapReduce framework directly working with multiple clusters without needing additional special infrastructure layers.

We propose a hierarchical MapReduce framework which takes the second approach to gather isolated cluster resources into a more capable one for running MapReduce jobs. Kavulya et al. characterize MapReduce jobs into four categories based on their execution patterns: map-only, map-mostly, shuffle-mostly, and reduce-mostly, and also find that 91% of the MapReduce jobs they have surveyed fall into the map-only and map-mostly categories [10]. Our framework partitions and distributes MapReduce jobs from these two categories (map-only and map-mostly) into multiple clusters to perform map-intensive computation, and collects and combines the outputs in the global node. Our framework also achieves load-balancing by assigning different task loads to different clusters based on the cluster size, current load, and specifications of the nodes. We have implemented the prototype framework using Apache Hadoop.

The rest of the paper is organized as follows. Section 2 presents some related works. Section 3 gives an overview of our hierarchical MapReduce framework. Section 4 presents more details on the multiple AutoDock runs using MapReduce. Section 5 gives experiment setup and result analysis. The conclusion and future work are given in Section 6.

2. RELATED WORKS Researchers have put significant efforts to the easy submission and optimal scheduling of massive parallel jobs in clusters, grids, and clouds. Conventional job schedulers, such as Condor [12], SGE [6], PBS [8], LSF [23], etc., aim to provide highly optimized resource allocation, job scheduling, and load balancing, within a single cluster environment. On the other hand, grid brokers and metaschedulers, e.g., Condor-G [4], CSF [3], Nimrod/G[1], GridWay [9], provide an entry point to multi-cluster grid environments. They enable transparent job submission to various distributed resource management systems, without worrying about

the locality of execution and available resources there. With respect to the AutoDock based virtual screening, our earlier efforts presented at National Biomedical Computation Resource (NBCR) [14] Summer Institute 2009, addressed the performance issue of massive docking processes by distributing the jobs to the grid environment. We used the CSF4 [3] meta-scheduler to split docking jobs to heterogeneous clusters where these jobs were handled by local job schedulers including LSF, SGE and PBS.

Clouds give users a notion of virtually unlimited, on-demand resources for computation and storage. Attributed to its ease of executing pleasantly parallel applications, MapReduce has become a dominant programming model for running applications in a cloud. Researchers are discovering new ways to make MapReduce easier to deploy and manage, more efficient and scalable, and also more able to accomplish complex data processing tasks. Hadoop On Demand (HOD) [7] uses the TORQUE resource manager [16] to provision and manage independent MapReduce and HDFS instances on shared physical nodes. The authors of [21] have identified some fundamental performance limitation issues in Hadoop and in the MapReduce model in general which make job response time unacceptably long when multiple jobs are submitted; by substituting their own scheduler implementation, they are able to overcome these limitations and improve the job throughput. CloudBATCH [22] is a prototype job queuing mechanism for managing and dispatching MapReduce jobs and commandline serial jobs in a uniform way. Traditionally a cluster must separate MapReduce-enabled nodes because they are dedicated to MapReduce jobs and cannot run serial jobs. But CloudBATCH uses HBase to keep various metadata on each job and also uses Hadoop to wrap commandline serial jobs as MapReduce jobs, so that both types of jobs can be executed using the same set of cluster nodes. The Map-Reduce-Merge is extended from the conventional MapReduce model to accomplish common relational algebra operations over distributed heterogeneous data sets [20]. In this extension, the Merge phase is a new concept that is more complex than the regular Map and Reduce phases, and requires the learning and understanding of several new components, including partition selector, processors, merger, and configurable iterators. This extension also modifies the standard MapReduce phase to expose data sources to support some relational algebra operations in the Merge phase.

Sky Computing [11] provides end user a virtual cluster interconnected with ViNe [18] across different domains. It aims to bring convenience by hiding the underlying details of the physical clusters. However, this transparency may cause unbalanced workload if a job is dispatched over heterogeneous compute nodes among different physical domains.

Our hierarchical MapReduce framework, aims to enable map-only and map-most jobs to be run across a number of isolated clusters (even virtual clusters), so these isolated resources can collectively provide a more powerful resource for the computation. It can easily achieve load-balance because the different clusters are visible to the scheduler in our framework.

3. HIERARCHICAL MAPREDUCE The hierarchical MapReduce framework we present in this paper consists of two layers. The top layer has a global controller that accepts user submitted MapReduce jobs and distributes them across different local cluster domains. Upon receiving a user job, the global controller divides the job into sub-jobs according to the capability of each local cluster. If the input data has not been deployed onto the cluster already, the global controller also

pthcfthecr

AsfrwMcplebscs–rfomac

3FMgtrrthMma

Wthainfrnbjo

partitions input hem to these c

clusters, the glofinal reduction uhe user. The bo

each receives sucontroller, perfoesults back to th

Although on thesimilar to the Mframework is verwork section, thMerge model iscomplex than programmers imearn this new co

but also need to msource. Our frconventional Masupply two Redu– instead of juequirement is t

formats of the loof the global Redmap-only, the prand the global coclusters and plac

3.1 ArchiteFigure 1 is a higMapReduce framglobal controllerransferer, a weducer. The botthe distributed lo

MapReduce masmanager. The coaccessible from t

Figure 1.

When a user subhe job schedule

assigns them toncluding the cur

from each local nodes making ubalance by ensurob in approxim

data proportionclusters. After bal controller c

using the global rottom layer consub-jobs and inpuorms local Maphe global control

e surface our fraMap-Reduce-Mer

ry different in nahe Merge phase s a new concep

the conventiomplementing jobs

oncept along wimodify the Map

ramework, on thap and Reduce, ucers – one localust one for the that the programocal Reducer outducer input key/vrogrammer does ontroller simplyes them under a

ecture gh-level architecmework. The tor, which consi

workload collecttom layer consisocal MapReducster node with ompute nodes inthe outside.

. Hierarchical M

bmits a MapReder splits the jobo each local clrrent workload r

cluster, as welup each cluster.ring that all clust

mately the same

nally to the subthe jobs are al

collects the outpreducer which is

sists of multiple ut data partitionpReduce compuler.

amework may arge model preseature. As discus

introduced in pt which is difonal Map ans under this modth the componen

ppers and Reduche other hand, and a programmReducer, and onregular MapR

mmer must matput keys/value value pairs. Hownot need to sup

y collects the macommon directo

cture diagram oop layer in our sts of a job stor, and a usests of multiple cle jobs, where ea workload re

nside each of th

MapReduce Arc

duce job to the b into a numberluster based onreported by the wll as the capabi. This is done ters will finish thtime. The glob

b-jobs, and sendll finished on aputs to perform s also supplied blocal clusters ths from the glob

utation and send

appear structuralented in [20], ossed in the relatethe Map-Reducfferent and mod Reduce, andel must not onnts required by ers to expose dastrictly uses th

mer just needs ne global Reduc

Reduce. The onake sure that thpairs match tho

wever, if the job pply any reducerap results from aory.

of our hierarchicframework is th

scheduler, a dae-supplied globlusters for runnineach cluster haseporter and a johe cluster are n

chitecture

global controller of sub-jobs ann several factorworkload reportility of individuto achieve loa

heir portion of thbal controller al

ds all

a by hat bal ds

lly ur ed

ce-ore nd

nly it,

ata he to

cer nly he se is

rs, all

cal he ata bal ng

a ob

not

er, nd rs, ter ual ad-he so

partitioinput dtransferconfiguAs soojob schthat cluis verycontrolthe timto the efficienget the get dofinishedwill trareceivinreducerthe orig

3.2 PThe pframewcomputGlobal from syntactReducean inplikewisan inteproducpairs. clustersusing tfunctiothe locthe Glo

Table

Func

R

Glob

Figure among diagramGlobal clustersnumberoccur, a(key/va1, and t(globalalso Mconsumintermepairs arthe lockey wit

ons the input dadata have not rer would transfuration files witon as the data trheduler at the gluster to start they expensive, weller to transfer d

me spent for trancomputation timnt and effective tfull benefit of p

ominated by dad on a local cluansfer the outpng all the outpur will be invokedginal job is map-

Programminprogramming mwork is the “Mtations are expreReduce. We usethe “local” R

tically, a Globaer. The Mappeput pair and pse, the Reducer, ermediate inputed by the Map tBoth the Mapp

s. The Global Rthe output from

ons and also the ical Reducers ouobal Reducer inp

e 1. Input and o

ction Name

Map

Reduce

bal Reduce

2 uses a tree-likthe Map, Redu

m, the root nodReduce takes

s that perform thrs shown in Figand the arrows ialue pairs) flow.then the input kel controller) to th

Map tasks are laumes an input ediate key/value

are passed to thecal clusters. Eacth a set of corre

ata in proportionbeen deployed

fer the user suppth the input datransfer finishes lobal controller e local MapRedu

recommend thadata when the siznsferring the datme. For large dto deploy them bparallelization anata transfer. Afster, if the appli

put back to theut data from ald to perform the-only.

ng Model model of our

Map-Reduce-Globessed as three fue the term “Glob

Reducer, but al Reducer is

er, just as a convproduces an int

just as a convet key and a setask, and outputsper and the RedReducer is execum the local clusinput and output

utput keys/value put key/value pa

output types of MReduce functi

Input , , , … ,, , … ,ke structure to shuce, and Globalde is the globaplace, and the

the Map and Regure 2 indicate tindicate the direc A job is submiey/value pairs arhe child nodes (l

unched at the lockey/value pair

e pairs. In Stepe Reduce tasks, ch Reduce task esponding value

n to the sub-job d before-hand. plied MapReduceta partitions to tfor a particular notifies the job

uce job. Since dat users only usze of input data ta is insignificandata sets, it woubefore-hand, so tnd the overall timfter the local sication requires, e global controlll local clusterse final reduction

hierarchical bal Reduce” munctions: Map, Rbal Reduce” to dconceptually ajust another c

entional Mappertermediate key/ntional Reducer

et of corresponds a different set oducer are execututed on the globsters. Table 1 lt data types. Thpairs must mat

irs.

Map, Reduce, aions

O

how the data flol Reduce functial controller on

leaf nodes repeduce functions. the order in whictions in which titted into the sysre passed from thlocal clusters) incal clusters wherr and producesp 3, the set of iwhich are also consumes an i

s, and produces

sizes if the The data

e jar and job the clusters. cluster, the manager of

data transfer e the global is small and nt compared uld be more that the jobs me does not

sub-jobs are the clusters

ller. Upon , the global task, unless

MapReduce model where Reduce, and distinguish it as well as conventional r does, takes /value pair; r does, takes ding values of key/value ted on local al controller lists these 3 he formats of tch those of

and Global

Output , , ,

ow sequence ion. In this n which the present local

The circled ch the steps the data sets stem in Step he root node n Step 2, and re each Map s a set of intermediate launched at

intermediate yet another

saRcRS

Tjuhcdcminecc

3Tah

Tspcow

Insrucadato

WaswrubaMb

ao∈a

set of key/value pare send back tReduce task. Thecorresponding vaReducers, perforStep 5.

Theoretically, theust two hierarch

have more depthcontrollers similadivide its assignclusters. But formore than two lancrease the com

each additional lclusters availablecreate a broader b

3.3 Job SchThe main challenamong each lochow the datasets

The input datasesubmitted by thepre-deployed on catalog to the uson the global cowhen partitioning

n this paper, wsubmitted by theun separate sub

consuming and automatically codataset using usand divides the do each cluster.

We make the aapplication are csame amount of we will see in unning multiple

behavior. The scas follows. LetMappers that canbe the number badded for executof CPU Cores o∈ 1, . . . , n . Weassigns to each c

pairs as output. to the global ce Global Reducalues that were orms the computa

e model we preshical layers, i.e. h by turning thar to the global ed jobs and runr all practical puayers for the for

mplexity as well ayer. If a researe, it is most likebottom layer tha

Figure 2. Progr

heduling andnge of our workal MapReduce are partitioned.

et for a particulae user to the glob

the local clusteser who runs thentroller takes ing the datasets an

we focus on the e user. If the usb-jobs on diffeerror-prone. O

ount the total ner-implemented dataset and assig

assumption that computation inte

time to run – ththe next sectio

AutoDock instacheduling algoritt n be run concurr

of Mappers be the number otion on n , whe

e also use to dore, that is,

In Step 4, the locontroller to pere task takes in aoriginally producation, and produ

sent can be extenthe tree structur

he leaf clusters controller and e

n them on its owurposes, we do nreseeable futureas the overhead

rcher has a largeely more efficiean to increase the

ramming Mode

d Data Partk is how to balan

cluster, which

ar MapReduce jbal controller beers and is exposee MapReduce jo

nto considerationnd scheduling the

situation whereser manually splerent clusters, iOur global contnumber of recoInputFormat an

gns the correct n

all map tasks ensive and take ahis is a reasonabon that applyinances displays exthm we use for be the maximrently on currently runniof available Map; be

ere i is the clusdefine how many

ocal reduce outprform the Globa key and a set ced from the locuces the output

nded to more thare in Figure 2 cainto intermediaach would furth

wn set of childrenot see a need f, because it coud introduced wie number of smaent to use them e depth.

el

titioning nce the workloais closely tied

job may be eithefore execution, ed via a metadaob. The scheduln the data localie job.

e input dataset lit the dataset ant would be timtroller is able ords in the inpnd RecordReadenumber of record

of a MapReduapproximately thble assumption

ng MapReduce xactly this kind our framework mum number

; ng on ppers that can be the total numbter number, andy map tasks a us

put bal of

cal in

an an ate her en for uld ith all to

ds to

her or

ata ler ity

is nd me to

put er, ds

ce he as to of is of

;

be ber d i ser

Normacomput

For sim

The wefactor speed, dependcomput

Let x, whicthe Maschedu

After pusing emove tlocal cl

4. AWe apAutoDoframewoutputsthe Autare ligasimple input corresp

Table

au

sum

For ouReduce

1) Mapexecutasummaconstan

2) Reduthe con

ally we set tation intensive j

mplicity, let

eight of each su is the compu

memory size, sding on the charatation intensive

∑ be the to

ch can be calculaap tasks, and uled to f

, partitioning the equation (5), wethe data items aclusters, or from l

UTODOCKpply the Mapock instances

work to prove ths of AutoGrid (o

utoDock. The keand names and

input file formrecord, which

ponds to a map ta

e 2. AutoDock M

Field

ligand_name

autodock_exe

input_files

output_dir

utodock_parame

summarize_exe

mmarize_param

ur AutoDock Me functions are im

p: The Map taskable against a sharize_result4.py nt intermediate k

uce: The Reducnstant intermedi

1 in the locajobs, so we get

ub-job can be cauting power of storage capacityacteristics of theor I/O intensive

otal number of Mated from the nu, be the for job x, so thatMapReduce jo

e number the daccordingly, eithlocal cluster to l

K MAPREDpReduce paradi

using the he feasibility of one tool in the A

ey/value pairs ofthe location of

mat for AutoDocontains 7 fi

ask.

MapReduce inp

e Pat

Ou

eters

e P

meters Su

MapReduce, the mplemented as f

k takes a ligand thared receptor, an

to output the lokey.

e task takes all iate key and sor

al MapReduce

lculated from (4each cluster, e.g

y, etc. The actue jobs, i.e., whet

Map tasks for a pumber of keys in

number of Mapt

ob to Sub-MapRta items of the der from global c

local cluster.

DUCE igm to runnin

hierarchical our approach. W

AutoDock suite)f the input of theligand files. We

ock MapReduceields shown in

put fields and de

Descriptio

Name of the l

th to AutoDock

Input files of Au

tput directory of

AutoDock para

Path to summari

ummarize script p

Map, Reduce, follows:

to run the AutoDnd then runs a Powest energy re

the values correrts the values by

(1)

clusters for

(2)

(3)

4) where the g., the CPU

ual varies ther they are

(4)

articular job n the input to p tasks to be

(5)

Reduce jobs datasets and controller to

ng multiple MapReduce We take the ) as input to e Map tasks e designed a jobs. Each n Table 2,

escriptions

on

ligand

executable

utoDock

f AutoDock

ameters

ize script

parameters

and Global

Dock binary Python script esult using a

esponding to y the energy

frlo

3oin

5Whasrintocinmlo

InccfrrcfrmbpE

TbminHHti

WHrudjoEs

C

ovvec

Inth

from low to highocal reducer inte

3) Global Reducof the local redunto a single file

5. EVALUAWe evaluate ohierarchical Mapand Shell scriptsstage-in and stageporter is a cnformation acceo make it a sep

code. Unfortunnformation we

modify Hadoop oad data by usin

n our evaluationcluster and two ccluster which hafrom outside. Aelated tasks, inc

cancellation. Thefrom outside. Semounted to each by the jobs. Futuparts, and each Eucalyptus, Nim

Tab

Cluster

Hotel Int

Alamo Int

Quarry Int

To deploy Hadobuilt-in job schmaintainability an shared directo

Hadoop programHadoop daemonimes.

We use three clHotel and Futureun Linux 2.6.1

dedicated masteobtracker) and

Each node in specifications of

Considering Aut1 per sectioon each node is version of Autoversion. The glexecution detailcomplexity.

n our experimenhe most import

h, and outputs thermediate key.

ce: The Global Rucer intermediatby the energy fr

ATIONS our model by pReduce systems. We use ssh age-out. On the lcomponent thatessed by global sparate program nately, Hadoop

need to externcode to add an

ng Hadoop Java A

n, we use severalclusters in Futures several login n

After a user logcluding job subme computation noeveral distributed

computation noureGrid partitionof which prov

mbus, and HPC.

ble 3. Cluster N

CPU

tel Xeon 2.93GH

tel Xeon 2.67GH

tel Xeon 2.33GH

oop to traditionaheduler (PBS) and performancery while store d

m (Java jar filens whereas the

lusters for evalueGrid Alamo. Ea18 SMP. Wither node (HDFother nodes arthese clusters

these cluster no

oDock being a Con 3.3 so that th

equal to the nuoDock we use iobal controller ls because our

nts, we use 6,00ant configuratio

he sorted results

Reduce finally tate key, sorts anrom low to high.

prototyping am. The system iand scp scripts local clusters’ s

exposes Hadoscheduler. Our orwithout touchin

does not enal applications,n additional daeAPIs.

l clusters includeGrid. IU Quarrynodes that are p

gins, he/she canmission, job statodes however, cd file systems (Lode for storing inns the physical cvides a different

Node Specificati

Cache s

Hz 8192K

Hz 8192K

Hz 6144K

al HPC clustersto allocate no

e, we install the data in local direces, etc.) is loadHDFS data is a

uations – IU Quach cluster has 2hin each clusterFS namenode re data nodes as has an 8-codes are listed in

CPU-intensive ae maximum numumber of cores is 4.2 which is

does not carer local job ma

00 ligands and 1on parameters is

s to a file using

akes all the valund combines the

a Hadoop baseis written in Javto finish the daide, the workloaoop cluster loariginal design w

ng Hadoop sourxpose the loa, and we had

emon that collec

ing the IU Quary is a classic HP

publicly accessibn do various jotus query and joannot be accesseLustre, GPFS) anput data accessecluster into severt testbed such

ions.

ize Memory

KB 24GB

KB 12GB

KB 16GB

, we first use thodes. To balan

Hadoop progractory, because th

ded only once baccessed multip

uarry, FutureGr1 nodes. They ar, one node is and MapRedu

and task trackerore CPU. ThTable 3.

application, we smber of map taskon the node. Th

s the latest stabe about low-levanagers hide th

1 receptor. One s ga_num_evals

g a

ues em

ed va ata ad ad

was ce ad to

cts

rry PC ble b-ob ed

are ed ral as

y

he ce

am he by ple

rid all

a ce rs. he

set ks he

ble vel he

of s -

numberprobabexperie5,000,0

Figu

Figure during trackermomennumberbeginniTowardquicklyindicatiMapReby thos

Tab

NumMapPer C

1

5

1

1

2

Test CaOur firControperformAutoDo2000 li4 for re

As is numberlinear, The tot

r of evaluationility that better

ences, the ga_nu000. We configu

ure 3: Number M

3 plots the numthe job executio

rs, so the maximnt is 20 * 8 = r of running ming and stays ds the end of jy (roughly 0 -ing that node ueduce tasks comese new tasks.

ble 4. MapReduunder dif

mber of p Tasks Cluster

Ho(seco

100 10

500 17

000 29

500 43

2000 59

ase 1: rst test case is a ller to find out

ms under diffeock in the Hadigand/receptor pesults.

reflected in Figr of map tasks regardless of thtal execution tim

s. The larger itr results may beum_evals is typiure it to 2,500,00

of running mapMapReduce ins

mber of running mon. The cluster

mum number of160. From the

map tasks quicapproximately job execution, 5). Notice the

usage ratio is lowe in, the availab

uce execution timfferent number

Execution Tim

otel onds)

Al(sec

004 8

763 1

986 2

304 4

942 5

base test case wt how each of erent numbers

doop to process pairs in each of t

gure 4, the toin test case 1

he startup overheme of the jobs ru

ts value is, thee obtained. Basically set from 200 in our experim

p tasks for an Astance

map tasks withinhas 20 data nod

f running map t plot, we can s

ckly grows to constant for a it drops to a

ere is a tail new. At this momle mappers will

me on differentr of map tasks.

me on Three Clu

lamo conds)

Q(s

821

771

2962

4251

849

without involvingour local Hadoof map task

100, 500, 1000the three clusters

otal execution tion each cluster

ead of the MapRunning on the Qu

e higher the sed on prior 2,500,000 to ments.

Autodock

n one cluster des and task tasks at any see that the 160 in the long time.

small value ear the end, ment, if new

be occupied

t clusters

usters

Quarry seconds)

1179

2529

4370

6344

8778

g the Global oop clusters s. We ran 0, 1500 and s. See Table

ime vs. the r is close to Reduce jobs. uarry cluster

isTC

TOMcecsrudpscthbsc

Tligsajo

s approximatelyThe main reasonCPUs compared

Figure 4. Loca

Test Case 2: Our second tesMapReduce jobsclusters, which iequation (4) froconstant, and i ∈shows unning beforeh

distribution on epartition the datstage the data configuration filehe local MapRe

back to the globashows the data contexts.

Figure 5. Twpartitioned da

The input datasigands. The rec

gridmap files tostored in 600approximately 5-ob configuration

y 50% slower thn is that nodes owith that of Ala

al cluster MapRdifferent numb

st case shows s with -weighteis based on the om section 3.3,∈ 1, 2, 3 for o160, ghand. Thereforeach cluster is taset (apart from

together withe to local clusteeduce executional controller formovement cos

wo-way data matasets: local M

et of AutoDockceptor is describotaling 35MB in0 separate di-6 KB large. Inn file together h

han running on Aof the Quarry clamo and Hotel.

Reduce executionber of map tasks

the performaned partitioned da

following param, we set our three clustergiven no MapRre, the weight1/3.m shared dataseh the jar exers for execution

n, the output filr the final globalst in the stage-

movement cost oMapReduce inpu

k contains 1 rebed as a set of n size, and the irectories, eachn addition, the ehas a total of 30

Alamo and Hoteuster have slow

n time based ons.

nce of executinatasets on differemeters setup. F

, where C is rs. Our calculatioReduce jobs at of map task We then equalet) into 3 piececutable and jo

n in parallel. Aftes will be stagel reduce. Figure-in and stage-o

f -weighted uts and outputs

eceptor and 600approximately 26000 ligands a

h of which executable jar an0KB in size. F

el. wer

n

ng ent or a

on are ks lly es, ob ter ed 5

out

00 20 are

is nd or

each ccontainexecutatransferdecompin.” Simfiles tocomprecontrolAs we 13.88 ttakes 2little locomparexecuti

The timMapReclustersdata min Figutime tapproxHotel aall the only 16on Qua

Figur

Test CaIn our MapRedifferentest casassignethe samof timeslower QuarryIntel(R2.93GHthat of differenfrequennumbercapabilscheduHere w2

cluster, the gloning 1 receptorable jar, and jors it to the dpressed. We callmilarly, when th

ogether with conessed into a taller. We call thiscan see from F

to 17.3 seconds t2.28 to 2.52 seconger to transferre to the relatiions.

me it takes to reduce clusters vas. The local M

movement costs (ure 6. The Hotel to finish their

ximately 3,000 mand Alamo. Thelocal results are6 seconds to finarry becomes the

re 6. Local Mapdatasets, i

ase 3: third test case,

educe jobs witnt clusters, whises 1 and 2, we ed the same numme amount of dae to finish. Athan Alamo an

y, Alamo and HR) Xeon(R) X55Hz, respectivelyprocessing timence in processinncies, therefore,r of cores forlities of each

uling policy to awe set 2.92 for Quarry. As

obal controller r file set, 200

ob configurationdestination clusl this global-to-lhe local MapRentrol files (typicaarball and transs local-to-global Figure 5, the dato finish, while tconds to finish. r the data but thively long dura

run 2000 map aries due to the d

MapReduce exec(both data stage and Alamo clusr jobs, but t

more seconds to fe Global Reducee ready in the glnish. Thus, the re bottleneck on t

pReduce turnarincluding data m

we evaluate theth -weighted

ich is based on have observed t

mber of computeata, they take sig

Among the threed Hotel. The sp

Hotel are Intel(R550 2.67GHz, any. The inverse rae match roughly.ng time is main, it is not enour load balancicore are also add CPU freque93 for Hotel, is for test case

creates a 1400 ligands diren files, all compster, where thelocal procedureduce jobs finishally 300-500KBsferred back to procedure “data

ata stage-in procthe data stage-ouThe Alamo clu

he difference is iation of local

tasks on each odifferent specificcution makespan-in and stage-ousters take similathe Quarry clufinish, about 50%e task is only inlobal controller, relatively poor pthe current job d

round time of movement cost

e performance od partitioned dthe following s

that although alle nodes and coregnificantly diffee clusters, Quarpecifications of tR) Xeon(R) E5

nd Intel(R) Xeonatio of CPU fre. So we hypothely due to the di

ugh to merely faing, and the cimportant. We

ency as a facto2.67 for A2, we again hav

MB tarball ectories, the pressed, and e tarball is “data stage-

h, the output in size) are the global a stage-out.” cedure takes ut procedure uster takes a insignificant MapReduce

of the local cation of the n, including ut) is shown ar amount of uster takes % more than nvoked after and it takes

performance distribution.

-weighted

of executing datasets on setup. From clusters are

es to process rent amount rry is much the cores on 5410 2GHz, n(R) X5570 equency and esize that the ifferent core factor in the computation

refine our r to set . Alamo, and

ve calculated

b

ato

Fslic12trrp

F

Wmsath

beforehand. Th0.35and Quarry respeo the new weigh

Table 5. Numb

Cluster

Hotel

Alamo

Quarry

Figure 7 shows tscenario. The vaigands sets are

can see from the 17.64 seconds to2.2 to 2.6 seconransfer the data elatively long d

previous test case

Figure 7. Twpartitioned da

Figure 8. Localdata

With weighted makespan, includstage-out) are shamount of time hat our refined s

160, given nohus, the wei505, and ectively. The daht. Table 5 show

ber of Map TasTime on E

Number of Map

2316

2103

1581

the data movemariations in the squite small, whigraph, the data

o finish, while thnds to finish. A

but the differenduration of local e.

wo-way data moatasets: local M

l MapReduce tuasets, including

partition, theding data movemhown in Figure to finish the locscheduler config

o MapReduce jights are 0.2635 ftaset is also parts how the datase

sks and MapRedach Cluster

p Tasks Exec

(S

ment cost in the wsize of tarball diich is smaller thstage-in proced

he data stage-ouAlamo takes a lince is also insig

MapReduce ex

ovement cost of MapReduce inpu

urnaround timedata movemen

e local MapRment costs (both8. All three clu

cal MapReduce guration improve

jobs are runnin0.386for Hotel, Alamtitioned accordinet is partitioned.

duce Execution

cution Time Seconds)

5915

5888

6395

weighted partitiofferent number han 2MB. As wure takes 12.34

ut procedure takittle bit longer

gnificant given thxecutions as in th

f -weighted uts and outputs

e of -weightet cost

Reduce executioh data stage-in anusters take similjobs. We can s

es performance b

ng 0, mo,

ng

n

on of

we to

kes to he he

ed

on nd lar ee by

balancireductisorts thprocesssecond

6. CIn thisframewclustersimplemReducefunctiocontrolmultiplfunctiocontrolcapacitWe useperformworklominimu

There afuture applicathe CPUhas larOther sto be cin our cscp, whHowevsolutionwith a can alswell inalternatalso exdata set

7. AThis wand Mprovidifeedbacto Chat

8. RE[1] Bu

arsyHP

[2] Deda(Jaht

[3] DiWLiVo10ht

ing workload amion combines pahe results. Thesing 6000 map

ds.

ONCLUSIOs paper, we hawork that can gas and run MapR

mented in this fe” model wherons: Map, Redller in our framele “local” MapR

ons, and the locller to run the Gty-aware algorithe multiple Auto

mance of our oads are well baum.

are several potework. Based

ation, our scheduU specificationsrger data sets tscheduling metr

considered. The current prototyphich may not w

ver, they can ben for remote jobmeta-scheduler,

so be switched ton heterogeneoustive to transferri

xplore the feasibts among global

CKNOWLEwork funded in pMicrosoft. Our

ing us early accck on our work.thura Herath for

REFERENCuyya, R., Abramrchitecture for a rystem in a globalPC ASIA'2000,

ean, J. and Ghemata processing onanuary 2008), 10ttp://doi.acm.org

ing, Z., Wei, X.,W. Customized Pl

ife Sciences Appolume 25, Numb0.1007/s00354-0ttp://dx.doi.org/1

mong clusters. Inartial results froe average globap tasks (ligand

ON AND FUave presented ather computatioReduce jobs acrframework adopre computationduce, and Glowork splits the dReduce clusters

cal results are rGlobal Reduce hm to balance thoDock runs as framework. Th

alanced and the

ential improvemon the computuling algorithm s. It will not be ththat data moverics such as diskremote job subm

pe are built uponwork well in a he replaced by otb submission is , e.g., CSF and No solutions that s environmentsing data explicit

bility of using a l controller and l

EDGMENTpart by the Pervspecial thanks cess to FutureG. We also wouldr discussions.

ES mson, D., Giddy,

resource managel computational China, IEEE CS

mawat, S. 2008.n large clusters. 07-113. DOI=10

g/10.1145/13274

, Luo, Y., Ma, Dlug-in Modules plications, New ber 4, 373-394, 2007-0024-6 10.1007/s00354-

n the final stageom lower-level cal reduce time d/receptor dock

UTURE WOa hierarchicalon resources froross them. The pt the “Map-Redns are expresseobal Reduce. data set and maps to run Map areturned back tofunction. We u

he workload amoa test case to ehe result show

e total makespan

ents we will adte-intensive natonly takes conshe case when an

ement becomes k I/O and netwomission and datan the combinationheterogeneous ether solutions. Oto integrate our

Nimrod/G. Dataare more scalab, such as gridftly from site to sshared file syst

local Hadoop clu

TS vasive Technoloto Dr. Geoffre

Grid resources ad like to express

J. Nimrod/G: anement and schedgrid, in: Proceed

S Press, USA, 20

. MapReduce: siCommun. ACM

0.1145/1327452452.1327492

D., Arzberger, P. in MetascheduleGeneration Com2007, DOI:

-007-0024-6

e, the global clusters and taken after

king) is 16

ORK MapReduce

om different applications duce-Global ed as three The global

ps them onto and Reduce o the global use resource ong clusters. evaluate the

ws that the n is kept in

ddress in our ture of the sideration of n application

significant. ork I/O need a movement n of ssh and nvironment.

One possible r framework a movement

ble and work ftp. As an site, we will tem to share usters.

ogy Institute ey Fox for

and valuable s our thanks

n duling dings of the 000.

implified M 51, 1

.1327492

W., Li, W. er CSF4 for

mputing

[4] Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S. 2002. Condor-G: A Computation Management Agent for Multi-Institutional Grids. Cluster Computing 5, 3 (July 2002), 237-246. DOI=10.1023/A:1015617019423 http://dx.doi.org/10.1023/A:1015617019423

[5] FutureGrid, http://www.futuregrid.org

[6] Gentzsch, W. (Sun Microsystems). 2001. Sun Grid Engine: Towards Creating a Compute Power Grid. In Proceedings of the 1st International Symposium on Cluster Computing and the Grid (CCGRID '01). IEEE Computer Society, Washington, DC, USA, 35-39

[7] Hadoop On Demand, http://hadoop.apache.org/common/docs/r0.17.2/hod.html

[8] Henderson, R. L.. 1995. Job Scheduling Under the Portable Batch System. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing (IPPS '95), Dror G. Feitelson and Larry Rudolph (Eds.). Springer-Verlag, London, UK, 279-294.

[9] Huedo, E., Montero, R. S., and Llorente, I. M. 2004. A framework for adaptive execution in grids. Softw. Pract. Exper. 34, 7 (June 2004), 631-651. DOI=10.1002/spe.584 http://dx.doi.org/10.1002/spe.584

[10] Kavulya, S., Tan, J., Gandhi, R., and Narasimhan, P. 2010. An Analysis of Traces from a Production MapReduce Cluster. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGRID '10). IEEE Computer Society, Washington, DC, USA, 94-103. DOI=10.1109/CCGRID.2010.112 http://dx.doi.org/10.1109/CCGRID.2010.112

[11] Keahey, K., Tsugawa, M., Matsunaga, A., and Fortes, J. 2009. Sky Computing. IEEE Internet Computing 13, 5 (September 2009), 43-51. DOI=10.1109/MIC.2009.94 http://dx.doi.org/10.1109/MIC.2009.94

[12] Litzkow, M. J., Livny, M., Mutka, M. W. Condor - A Hunter of Idle Workstations. ICDCS 1988:104-111

[13] Morris, G. M., Huey, R., Lindstrom, W., Sanner, M. F., Belew, R. K., Goodsell, D. S. and Olson, A. J. (2009), AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. Journal of Computational Chemistry, 30: 2785–2791. doi: 10.1002/jcc.21256

[14] National Biomedical Computation Resource, http://nbcr.net

[15] Qiu, J., Ekanayake, J., Gunarathne, T., Choi, J. Y., Bae, S. Ruan, Y., Ekanayake, S., Wu, S., Beason, S., Fox, G., Rho,

M., Tang, H., “Data Intensive Computing for Bioinformatics”, In Data Intensive Distributed Computing, IGI Publishers, 2010

[16] Staples, G. 2006. TORQUE resource manager. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06). ACM, New York, NY, USA, Article 8. DOI=10.1145/1188455.1188464 http://doi.acm.org/10.1145/1188455.1188464

[17] Teragrid, http://www.teragrid.org

[18] Tsugawa, M., and Fortes, J. A. B. 2006. A virtual network (ViNe) architecture for grid computing. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (IPDPS'06). IEEE Computer Society, Washington, DC, USA, 148-148.

[19] Vaqué, M., Arola, A., Aliagas, C., and Pujadas, G. 2006. BDT: an easy-to-use front-end application for automation of massive docking tasks and complex docking strategies with AutoDock. Bioinformatics 22, 14 (July 2006), 1803-1804. DOI=10.1093/bioinformatics/btl197 http://dx.doi.org/10.1093/bioinformatics/btl197

[20] Yang, H., Dasdan, A., Hsiao, R., and Parker, D. S. 2007. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD '07). ACM, New York, NY, USA, 1029-1040. DOI=10.1145/1247480.1247602 http://doi.acm.org/10.1145/1247480.1247602

[21] Zaharia, M., Borthakur, D, Sarma, J. S., Elmeleegy, K., Shenker, S., and Stoica, I. Job Scheduling for Multi-User MapReduce Clusters, Technical Report UCB/EECS-2009-55, University of California at Berkeley, April 2009.

[22] Zhang, C., De Sterck, H., "CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase," Cloud Computing Technology and Science, IEEE International Conference on, pp. 368-375, 2010 IEEE Second International Conference on Cloud Computing Technology and Science, 2010

[23] Zhou, S, Zheng, X., Wang, J., and Delisle, P. 1993. Utopia: a load sharing facility for large, heterogeneous distributed computer systems. Softw. Pract. Exper. 23, 12 (December 1993), 1305-1336. DOI=10.1002/spe.4380231203 http://dx.doi.org/10.1002/spe.4380231203

a hierarchical framework for cross-domain mapreduce execution · a hierarchical framework for...

Documents