azure mapreduce - indiana university...

30
Azure MapReduce Thilina Gunarathne Salsa group, Indiana University

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Azure MapReduce

Thilina Gunarathne

Salsa group, Indiana University

Page 2: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Agenda

• Recap of Azure Cloud Services

• Recap of MapReduce

• Azure MapReduce Architecture

• Pairwise distance alignment implementation

• Next steps

Page 3: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Cloud Computing

• On demand computational services over web– Backed by massive commercial infrastructures giving

economies of scale– Spiky compute needs of the scientists

• Horizontal scaling with no additional cost– Increased throughput

• Cloud infrastructure services– Storage, messaging, tabular storage– Cloud oriented services guarantees– Virtually unlimited scalability

• Future seems to be CLOUDY!!!

Page 4: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Azure Platform

• Windows Azure Compute

– .net platform as a service

– Worker roles & web roles

• Azure Storage

– Blobs

– Queues

– Table

• Development SDK, fabric and storage

Page 5: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

MapReduce

• Automatic parallelization & distribution

• Fault-tolerant

• Provides status and monitoring tools

• Clean abstraction for programmers– map (in_key, in_value) ->

(out_key, intermediate_value) list

– reduce (out_key, intermediate_value list) ->

out_value list

Page 6: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Motivation

• Currently no parallel programming framework on Azure

– No MPI, No Dryad

• Well known, easy to use programming model

• Cloud nodes are not as reliable as conventional cluster nodes

Page 7: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Azure MapReduce Concepts

• Take advantage of the cloud services– Distributed services, Unlimited scalability – Backed by industrial strength data centers and

technologies

• Decentralized control• Dynamically scale up/down• Eventual consistency• Large latencies

– Coarser grained map tasks

• Global queue based scheduling

Page 8: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

1

1.Client driver loads the map & reduce tasks to the queues

Page 9: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

2

2. Map workers retrieve map tasks from the queue

Page 10: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

3

3. Map workers download data from the Blob storage and start processing

Page 11: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

4

4. Reduce workers pick the tasks from the queue and start monitoring the reduce task tables

Page 12: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

5

5. Finished map tasks upload the results to Blob storage. Add entries to the respective reduce task tables.

Page 13: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

6

6. Reduce tasks download the intermediate data products

Page 14: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

7

7. Start reducing when all the map tasks are finished and when a reduce task is finished downloading the intermediate data products

Page 15: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Azure MapReduce Architecture

• Client API and driver

• Map tasks

• Reduce tasks

• Intermediate data transfer

• Monitoring

• Configurations

Page 16: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Fault tolerance

• Use the visibility timeout of the queues– Currently maximum is 3 hours

– Delete the message from the queue only after everything is successful

– Execution, upload, update status

• Tasks will rerun when timeout happens– Ensures eventual completion

– Intermediate data are persisted in blob storage

– Retry up to 3 times

• Many retries in service invocations

Page 17: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Apache Hadoop

[24] /(Google MR)

Microsoft Dryad [25] Twister [19] Azure Map

Reduce/Twister

Programming

Model

MapReduce DAG execution,

Extensible to

MapReduce and other

patterns

Iterative

MapReduce

MapReduce-- will

extend to Iterative

MapReduce

Data Handling HDFS (Hadoop

Distributed File

System)

Shared Directories &

local disks

Local disks and

data management

tools

Azure Blob Storage

Scheduling Data Locality; Rack

aware, Dynamic

task scheduling

through global

queue

Data locality;

Network

topology based

run time graph

optimizations; Static task

partitions

Data Locality;

Static task

partitions

Dynamic task

scheduling through

global queue

Failure Handling Re-execution of

failed tasks;

Duplicate execution

of slow tasks

Re-execution of failed

tasks; Duplicate

execution of slow tasks

Re-execution of

Iterations

Re-execution of

failed tasks;

Duplicate execution

of slow tasks

Environment Linux Clusters,

Amazon Elastic Map

Reduce on EC2

Windows HPCS cluster Linux Cluster

EC2

Window Azure

Compute, Windows

Azure Local

Development

Fabric

Intermediate

data transfer

File, Http File, TCP pipes, shared-

memory FIFOs

Publish/Subscribe

messaging

Files, TCP

Page 18: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Why Azure Services

• Virtually unlimited scalable distributed services

• No need to install software stacks

– In fact you can’t

– Eg: NaradaBrokering, HDFS, Database

• Zero maintenance

– Let the platform take care of you

• Availability guarantees

Page 19: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

API

• ProcessMapRed(jobid, container, params, numReduceTasks, storageAccount, mapQName, reduceQName,ListmapTasks)

• Map(key, value, programArgs, Dictionary outputCollector)

• Reduce(key, List values, programArgs, Dictionary outputCollector)

Page 20: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Develop applications using Azure MapReduce

• Local debugging using Azure development fabric

• DistributedCache

– Bundle with Azure Package

• Compile in release mode before creating the package.

• Deploy using Azure web interface

• Errors logged to a Azure Table

Page 21: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

SWG Pairwise Distance Alignment

• SmithWaterman-GOTOH

• Pairwise sequence alignment

– Align each sequence with all the other sequences

Page 22: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Application architectureBlock decomposition

1

(1-100)

2

(101-200)

3

(201-300)

4

(301-400)

1

(1-100)M1 M2 from M6 M3 Reduce 1

2

(101-200)from M2 M4 M5 from M9

Reduce 2

3

(201-300)M6 from M5 M7 M8

Reduce 3

4

(301-400)from M3 M9 from M8 M10

Reduce 4

Page 23: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

AzureMR SWG Performance10k Sequences

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 32 64 96 128 160

Exe

cuti

on

Tim

e (

s)

Number of Azure Small Instances

Execution Time(s)

Page 24: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

AzureMR SWG Performance10k Sequences

0

1

2

3

4

5

6

7

0 32 64 96 128 160

Alig

nm

en

t Ti

me

(m

s)

Number of Azure Small Instances

Time Per Alignment Per Instance

Page 25: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

AzureMR SWG Performance on Different Instance Types

0

100

200

300

400

500

600

700

Small Medium Large ExtraLarge

Exe

cuti

on

Tim

e (s

)

Instance Type

Execution Time

Page 26: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

AzureMR SWG Performance on Different Data Sizes

0

1

2

3

4

5

6

7

8

4000 5000 6000 7000 8000 9000 10000

Tim

e fo

r an

Act

ual

Ali

gem

en

t (m

s)

Number of Sequences

Time Per Alignment Per Core (ms)

Page 27: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Next Steps

• In the works

– Monitoring web interface

– Alternative intermediate data communication mechanisms

– Public release

• Future plans

– AzureTwister

• Iterative MapReduce

Page 28: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Thanks!!

• Questions?

Page 29: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

References

• J. Dean, and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107-113., 2008.

• J.Ekanayake, H.Li, B.Zhang et al., “Twister: A Runtime for iterative MapReduce,” in Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference June 20-25, 2010, Chicago, Illinois, 2010.

• Cloudmapreduce, http://sites.google.com/site/huanliu/cloudmapreduce.pdf

• "Apache Hadoop," http://hadoop.apache.org/

• M. Isard, M. Budiu, Y. Yu et al., "Dryad: Distributed data-parallel programs from sequential building blocks." pp. 59-72.

Page 30: Azure MapReduce - Indiana University Bloomingtonsalsahpc.indiana.edu/.../slides/0728/Azure_MapReduce.pdf · 2010-07-29 · Apache Hadoop [24] /(Google MR) Microsoft Dryad [25] Twister

Acknowledgments

• Prof. Geoffrey Fox, Dr. Judy Qiu and the Salsa group

• Dr. Ying Chen and Alex De Luca from IBM Almaden Research Center

• Virtual School Organizers