yyx twister4azure : iterative mapreduce for azure...

Twister4Azure

Familiar MapReduce programming model

Fault Tolerance features similar to traditional MapReduce.

No single point of failure.

Combiner step

Supports dynamically scaling up and down of the compute resources.

Web based monitoring console

Easy testing and deployment using Azure local development fabric.

Twister4Azure : Iterative MapReduce for Azure Cloud

Thilina Gunarathne, Judy Qiu, Geoffrey Fox {tgunarat, xqiu, gcf}@indiana.edu

http://salsahpc.indiana.edu/twister4azure

Introduction

There exists many algorithms that rely on iterative computations, where each iterative step can be easily specified as a MapReduce computation. MapReduceRoles for Azure (MR4Azure) is a decentralized, dynamically scalable MapReduce runtime we developed for Windows Azure Cloud platform using Microsoft Azure cloud infrastructure services as the building blocks. Twister4Azure extends MR4Azure to support optimized iterative MapReduce executions, enabling a wide array of large scale iterative data analysis and scientific applications to utilize Azure platform easily and efficiently.

Performance Comparison

Reduce

Reduce

MergeAdd

Iteration? No

Map Combine

Map Combine

Map Combine

Data Cache

Yes

Hybrid scheduling of the new iteration

Job Start

Job Finish

K-Means Clustering

KMeans iterative MapReduce performance. 16 Azure Small instances, 6 iterations, 8 to 48 million 20-D data points.

Left: Performance with and without data caching. Right: Speedup obtained from using the data cache

Map 1

Map 2

Map n

Map Workers

Red 1

Red 2

Red n

Reduce Workers

In Memory Data Cache

Task Monitoring

Role Monitoring

Worker Role

MapID ……. Status

Map Task Table

MapID ……. Status

Job Bulleting Board

Scheduling Queue

A Decentralized, Dynamically Scalable, Fault Tolerant Iterative MapReduce Framework Built Using Cloud Services for Microsoft Azure Cloud.

Twister4Azure Computation Flow

REFERENCES

Gunarathne, T., Wu,T.L., Qiu, J., and Fox, G.C. 2010. MapReduce in the Clouds for Science. In Proceedings of CloudCom 2010 Conference (Indianapolis,December 2010)

Ekanayake, J., Li, H., Zhang B., et al., 2010. Twister: A Runtime for iterative MapReduce, in Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference (Chicago, June 2010)

Gunarathne, T., Wu, T. L., Qiu, J., et al., 2010. Cloud computing paradigms for pleasingly parallel biomedical applications. Submitted for publication in Concurrency and Computation: Practice and Experience journal. ACKNOWLEDGEMENTS Microsoft Exploratory Research in Clouds and Platforms Grant, FutureGrid and Salsa Group.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

128 228 328 428 528 628 728

Para

llel E

ffic

ienc

y

Number of Query Files

Twister4Azure

Hadoop-Blast

DryadLINQ-Blast

CAP3 Sequence Assembly Using Azure Small instances for Twister4Azure, Amazon EC2 High-CPU-Extra-Large instances for Amazon Elastic MapReduce and 8 core 16GB per node cluster for Apache Hadoop. Each file contained 458 reads.

0

500

1000

1500

2000

2500

3000

Ad

just

ed

Tim

e (

s)

Num. of Cores * Num. of Blocks

Twister4Azure

Amazon EMR

Apache Hadoop

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

Pa

rall

el

Eff

icie

ncy

Num. of Cores * Num. of Files

Twister4Azure

Amazon EMR

Apache Hadoop

MapReduceRoles for Azure

Use distributed, highly scalable and highly available cloud services as the building blocks.

Azure Queues for task scheduling.

Azure Blob storage for input, output and intermediate data storage.

Azure Tables for meta-data storage and monitoring

Utilize eventually-consistent , high-latency cloud services effectively to deliver performance comparable to traditional MapReduce runtimes.

Minimal management and maintenance overhead

Iterative extensions

Merge Step

In-Memory Caching of static data

Cache aware hybrid scheduling using Queues as well as a bulletin board (special table)

0%

20%

40%

60%

80%

100%

120%

140%

160%

0

200

400

600

800

1000

1200

1400

1600

8 X 16M 16 X 32M 32 X 64M 48 X 96M 64 X 128M

Re

lati

ve P

ara

lle

l Eff

icie

ncy

Tim

e (

s)

Num Instances X Num Data Points

Relative ParallelEfficiency

Time(s)

Left: Scaling speedup with increasing number of instances (Azure Small) & data for 10 iterations. Right: Increasing number of iterations using 16 million data points with caching using 16 Azure Small instances.

BLAST

NCBI BLAST+ sequence searching using NR protein sequence database (~8.7 GB) using 128 CPU cores. 16 Extra-Large Azure instances were used for Twister4Azure. 8-core 16GB memory per node cluster was used Apache Hadoop. 16-core 16GB memory per node Windows HPC cluster was used for DryadLINQ testing. Each query file == 100 sequences.

Smith Waterman-GOTOH All-Pairs Sequence Alignment

Azure Small instances for Twister4Azure, Amazon EC2 High-CPU-Extra-Large instances for Amazon Elastic MapReduce and 8 core 16GB per node cluster for Apache Hadoop. Block == 200 sequences.

yyx twister4azure : iterative mapreduce for azure...

Documents