design pattern for scientific applications in dryadlinq...

SALSA SALSA

Design Pattern for Scientific Applications in DryadLINQ CTP

DataCloud-SC11 Hui Li

Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox

SALSA

Motivation

• One of latest release of DryadLINQ was published on Dec 2010

• Investigate usability and performance of using DryadLINQ language to develop data intensive applications

• Generalize programming patterns for scientific applications in DryadLINQ

SALSA

Big Data Challenge

Mega 10^6

Giga 10^9

Tera 10^12

Peta 10^15

SALSA

MapReduce Processing Model

Scheduling Fault tolerance

Workload balance Communication

SALSA

Disadvantages in MapReduce

• Rigid and flat data processing paradigm are difficult to express semantic of relational operation, such as Join.

• Pig or Hive can solve above issue to some extent but has several limitations: – Relational operations are converted into a set of

MapReduce tasks for execution

– For some equal join, it needs to materialize entire cross product to the disk

• Sample: MapReduce PageRank

SALSA

Microsoft Dryad Processing Model

Processing vertices

Channels (file, pipe, shared memory)

Inputs

Outputs Directed Acyclic Graph (DAG)

SALSA

Microsoft DryadLINQ Programming Model

• Higher level programing interface for Dryad

• Based on LINQ model

• Unified data model

• Integrated into .NET language

• Strong typed .NET objects

Common LINQ providers

Provider Base class

<T> Strong typed .NET objects

LINQ-to-objects IEnumerable<T>

PLINQ ParallelQuery<T>

DryadLINQ DistributedQuery<T>

SALSA 8

DryadLINQ

Subquery Subquery Subquery Subquery Subquery

Query

Application Program Legacy Code

Pleasingly Parallel Programing Patterns

User defined function User defined function

Sample applications: 1. SAT problem 2. Parameter sweep 3. Blast, SW-G bio

Issue: Scheduling resources in the granularity of node rather than core

SALSA

Hybrid Parallel Programming Pattern

9

Query

DryadLINQ

PLINQ

subquery User defined function

TPL

Sample applications: 1. Matrix Multiplication 2. GTM and MDS

Solve previous issue by using PLINQ, TPL, Thread Pool technologies

SALSA

Distributed Grouped Aggregation Pattern

Three aggregation approaches: 1. Naïve aggregation 2. Hierarchical aggregation 3. Aggregation tree

Sample applications: 1. Word count 2. PageRank

SALSA

Implementation and Performance

TEMPEST TEMPEST TEMPEST-CNXX

CPU Intel E7450 Intel E7450

Cores 24 24

Memory 24.0 GB 50.0 GB

Memory/Core 1 GB 2 GB

STORM STORM-CN01,CN02, CN03

STORM-CN04,CN05 STORM-CN06,CN07

CPU AMD 2356 AMD 8356 Intel E7450

Cores 8 16 24

Memory 16 GB 16 GB 48 GB

Memory/Core 2 GB 1 GB 2 GB

Hardware configuration

1. We use DryadLINQ CTP version released in December 2010 2. Windows HPC R2 SP2 3. .NET 4.0, Visual Studio 2010

SALSA

Pairwise sequence comparison using Smith Waterman Gotoh

Var SWG_Blocks = create_SWG_Blocks(AluGeneFile, numOfBlocks, BlockSize) Var inputs= SWG_Blocks.AsDistributedFromPartitions(); Var outputs= inputs.Select(distributedObject => SWG_Program(distributedObject));

1. Pleasingly parallel application 2. Easy to program with

DryadLINQ 3. Easy to tune task granularity

with DryadLINQ API

SALSA

Workload balance in SW-G

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

31 62 124 186 248

Exec

uti

on

Tim

e (S

eco

nd

s)

Number of Partitions

Std. Dev. = 50

Std. Dev. = 100

Std. Dev. = 250

1. SWG tasks are heterogeneous in CPU time.

2. Coarse task granularity brings workload balance issue

3. Finer task granularity has more scheduling overhead

Simply solution: tune the task granularity with DryadLINQ API Related API:

AsDistributedFromPartition() RangePartition<Type t>()

SALSA

Square dense matrix multiplication

Parallel MM algorithms: 1.Row partition 2.Row/Column partition 3.Fox algorithm

Multiple core parallel technologies: 1.PLINQ, 2.TPL, 3.Thread Pool

Pseudo Code of Fox algorithm:

Partitioned matrix A, B to blocks

For each iteration i:

1) broadcast matrix A block (k,j) to row k

2) compute matrix C blocks, and add the partial results

to the previous result of matrix C block

3) roll-up matrix B block by column

SALSA

Matrix multiplication performance results 1core on 16 nodes V.S 24 cores on 16 nodes

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

2400 4800 7200 9600 12000 14400 16800 19200

Fox-Hey RowPartition RowColumnPartition

Matrices Sizes

Rel

ativ

e P

aral

lel E

ffic

ien

cy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2400 4800 7200 9600 12000 14400 16800 19200

Fox-Hey RowPartition RowColumnPartition

Matrices Sizes

Rel

ativ

e P

aral

lel E

ffic

ien

cy

0

500

1000

1500

2000

2400 4800 7200 9600 12000 14400 16800 19200

Fox RowPartition Row/Column Partition

Matrices Sizes

Mfl

op

s

0

5000

10000

15000

20000

25000

2400 4800 7200 9600 12000 14400 16800 19200

Fox RowPartition RowColumnPartition

Matrices Sizes

Mfl

op

s

SALSA

PageRank

Three distributed grouped aggregation approaches

1. Naïve aggregation 2. Hierarchical aggregation 3. Aggregation Tree

PageRank with hierarchical aggregation approach

Foreach iteration { 1. join edges with ranks 2. distribute ranks on edges 3. groupBy edge destination 4. aggregate into ranks }

[1] Pagerank Algorithm, http://en.wikipedia.org/wiki/PageRank [2] ClueWeb09 Data Set, http://boston.lti.cs.cmu.edu/Data/clueweb09/

http://en.wikipedia.org/wiki/PageRank

http://boston.lti.cs.cmu.edu/Data/clueweb09/

SALSA

PageRank performance results Naïve aggregation v.s Aggregation Tree

0

50

100

150

200

250

320 480 640 800 960 1120 1280

Aggregation Tree

Hash Partition

Hierarchical Aggregation

Number of AM files

Seco

nd

per

Iter

atio

n

300

400

500

600

700

800

900

1000

1100

1200

100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Hash Partition Aggregation Tree

Number of Output Tuples

Seco

nd

Per

Iter

atio

n

SALSA

Conclusion

• We investigated the usability and performance of using DryadLINQ to develop three applications: SW-G, Matrix Multiply, PageRank. And we abstracted three design patterns: pleasingly parallel programming pattern, hybrid parallel programming pattern, distributed aggregation pattern

• Usability – It is easy to tune task granularity with DryadLINQ data model and

interfaces – The unified data model and interface enable developers to easily

utilize the parallelism of single-core, multi-core and multi-node environments.

– DryadLINQ provide optimized distributed aggregation approaches, and the choice of approaches have materialized effect on the performance

• Performance – The parallel MM algorithm scale up for large matrices sizes. – Distributed aggregation optimization can speed up the program by

30% than naïve approach

SALSA

Question?

• Thank you!

SALSA

Outline

• Introduction – Big Data Challenge – MapReduce – DryadLINQ CTP

• Programming Model Using DryadLINQ – Pleasingly – Hybrid – Distributed Grouped

Aggregation

• Implementation • Discussion and

Conclusion

SALSA

Workflow of Dryad job

...

Compute node

Vertex 1

Data

Compute node

Vertex 2

Data

Compute node

Vertex n

Data

Compute node

Dryad graph manager

Head node

DSC Service

HPC Job Scheduler

Service

DSC

Window HPC Server 2008 R2 Cluster

HPC Client Utilites

DSC Client Service

DryadLINQ Provider

Workstation computer

design pattern for scientific applications in dryadlinq...

Documents