Twister4Azure
Familiar MapReduce programming model
Fault Tolerance features similar to traditional MapReduce.
No single point of failure.
Combiner step
Supports dynamically scaling up and down of the compute resources.
Web based monitoring console
Easy testing and deployment using Azure local development fabric.
Twister4Azure : Iterative MapReduce for Azure Cloud
Thilina Gunarathne, Judy Qiu, Geoffrey Fox {tgunarat, xqiu, gcf}@indiana.edu
http://salsahpc.indiana.edu/twister4azure
Introduction
There exists many algorithms that rely on iterative computations, where each iterative step can be easily specified as a MapReduce computation. MapReduceRoles for Azure (MR4Azure) is a decentralized, dynamically scalable MapReduce runtime we developed for Windows Azure Cloud platform using Microsoft Azure cloud infrastructure services as the building blocks. Twister4Azure extends MR4Azure to support optimized iterative MapReduce executions, enabling a wide array of large scale iterative data analysis and scientific applications to utilize Azure platform easily and efficiently.
Performance Comparison
Reduce
Reduce
MergeAdd
Iteration? No
Map Combine
Map Combine
Map Combine
Data Cache
Yes
Hybrid scheduling of the new iteration
Job Start
Job Finish
K-Means Clustering
KMeans iterative MapReduce performance. 16 Azure Small instances, 6 iterations, 8 to 48 million 20-D data points.
Left: Performance with and without data caching. Right: Speedup obtained from using the data cache
Map 1
Map 2
Map n
Map Workers
Red 1
Red 2
Red n
Reduce Workers
In Memory Data Cache
Task Monitoring
Role Monitoring
Worker Role
MapID ……. Status
Map Task Table
MapID ……. Status
Job Bulleting Board
Scheduling Queue
A Decentralized, Dynamically Scalable, Fault Tolerant Iterative MapReduce Framework Built Using Cloud Services for Microsoft Azure Cloud.
Twister4Azure Computation Flow
REFERENCES
Gunarathne, T., Wu,T.L., Qiu, J., and Fox, G.C. 2010. MapReduce in the Clouds for Science. In Proceedings of CloudCom 2010 Conference (Indianapolis,December 2010)
Ekanayake, J., Li, H., Zhang B., et al., 2010. Twister: A Runtime for iterative MapReduce, in Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference (Chicago, June 2010)
Gunarathne, T., Wu, T. L., Qiu, J., et al., 2010. Cloud computing paradigms for pleasingly parallel biomedical applications. Submitted for publication in Concurrency and Computation: Practice and Experience journal. ACKNOWLEDGEMENTS Microsoft Exploratory Research in Clouds and Platforms Grant, FutureGrid and Salsa Group.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
128 228 328 428 528 628 728
Para
llel E
ffic
ienc
y
Number of Query Files
Twister4Azure
Hadoop-Blast
DryadLINQ-Blast
CAP3 Sequence Assembly Using Azure Small instances for Twister4Azure, Amazon EC2 High-CPU-Extra-Large instances for Amazon Elastic MapReduce and 8 core 16GB per node cluster for Apache Hadoop. Each file contained 458 reads.
0
500
1000
1500
2000
2500
3000
Ad
just
ed
Tim
e (
s)
Num. of Cores * Num. of Blocks
Twister4Azure
Amazon EMR
Apache Hadoop
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
Pa
rall
el
Eff
icie
ncy
Num. of Cores * Num. of Files
Twister4Azure
Amazon EMR
Apache Hadoop
MapReduceRoles for Azure
Use distributed, highly scalable and highly available cloud services as the building blocks.
Azure Queues for task scheduling.
Azure Blob storage for input, output and intermediate data storage.
Azure Tables for meta-data storage and monitoring
Utilize eventually-consistent , high-latency cloud services effectively to deliver performance comparable to traditional MapReduce runtimes.
Minimal management and maintenance overhead
Iterative extensions
Merge Step
In-Memory Caching of static data
Cache aware hybrid scheduling using Queues as well as a bulletin board (special table)
0%
20%
40%
60%
80%
100%
120%
140%
160%
0
200
400
600
800
1000
1200
1400
1600
8 X 16M 16 X 32M 32 X 64M 48 X 96M 64 X 128M
Re
lati
ve P
ara
lle
l Eff
icie
ncy
Tim
e (
s)
Num Instances X Num Data Points
Relative ParallelEfficiency
Time(s)
Left: Scaling speedup with increasing number of instances (Azure Small) & data for 10 iterations. Right: Increasing number of iterations using 16 million data points with caching using 16 Azure Small instances.
BLAST
NCBI BLAST+ sequence searching using NR protein sequence database (~8.7 GB) using 128 CPU cores. 16 Extra-Large Azure instances were used for Twister4Azure. 8-core 16GB memory per node cluster was used Apache Hadoop. 16-core 16GB memory per node Windows HPC cluster was used for DryadLINQ testing. Each query file == 100 sequences.
Smith Waterman-GOTOH All-Pairs Sequence Alignment
Azure Small instances for Twister4Azure, Amazon EC2 High-CPU-Extra-Large instances for Amazon Elastic MapReduce and 8 core 16GB per node cluster for Apache Hadoop. Block == 200 sequences.