distributed computing using dryad michael isard microsoft research silicon valley

Distributed computingusing Dryad

Michael IsardMicrosoft Research Silicon Valley

Distributed Data-Parallel Computing

• Cloud– Transparent scaling– Resource virtualization

• Commodity clusters– Fault tolerance with good performance

• Workloads beyond standard SQL, HPC– Data-mining, graph analysis, …– Semi-structured/unstructured data

Execution layer

• This talk: system-level middleware– Yuan Yu will describe DryadLINQ programming

model on Saturday• Algorithm -> execution plan by magic

Problem domain

• Large inputs– Tens of GB “small test dataset”– Single job up to hundreds of TB– Semi-structured data is common

• Not latency sensitive– Overhead of seconds for trivial job– Large job could take days– Batch computation not online queries• Simplifies fault tolerance, caching, etc.

Talk overview

• Some typical computations• DAG implementation choices• The Dryad execution engine• Comparison with MapReduce• Discussion

Map

• Independent transformation of dataset– for each x in S, output x’ = f(x)

• E.g. simple grep for word w– output line x only if x contains w

Map



S f S’

Map



S1 f S1’

S2 f S2’

S3 f S3’

Reduce

• Grouping plus aggregation– 1) Group x in S according to key selector k(x)– 2) For each group g, output r(g)

• E.g. simple word count– group by k(x) = x– for each group g output key (word) and count of g

Reduce

• Grouping plus aggregation– 1) Group x in S according to key selector k(x)– 2) For each group g, output r(g)

• E.g. simple word count– group by k(x) = x– for each group g output key (word) and count of g

S G S’r

Reduce

S G S’r

Reduce

S1 S1’GD

S2’GS2 D

S3 D

D is distribute, e.g. by hash or range

r

r

S3’G r

Reduce

S1 G

S1’G

D

S2’G

S2 G D

S3 G D

ir is initial reduce, e.g. compute a partial sum

r

r

ir

ir

ir

K-means

• Set of points P, initial set of cluster centres C• Iterate until convergence:– For each c in C• Initialize countc, centrec to 0

– For each p in P• Find c in C that minimizes dist(p,c)• Update: countc += 1, centrec += p

– For each c in C• Replace c <- centrec/countc

K-means

C0 ac cc

Pac cc

ac cc

C1

C2

C3

K-means

C0 ac

P1

P2

P3

acac

cc C1

acac

ac

cc C2

acac

ac

cc C3

Graph algorithms

• Set N of nodes with data (n,x)• Set E of directed edges (n,m)• Iterate until convergence:– For each node (n,x) in N• For each outgoing edge n->m in E, nm = f(x,n,m)

– For each node (m,x) in N• Find set of incoming updates im = {nm: n->m in E}

• Replace (m,x) <- (m,r(im))

• E.g. power iteration (PageRank)

PageRank

N0 ae de

Eae de

ae de

N1

N2

N3

PageRank

N0

1ae

E1

E2

E3

aeae

cc

N0

2

N0

3

cc

cc

DD

D

N1

1 N1

2 N1

3

aeae

ae

cc

cc

cc

DD

D

N2

1 N2

2 N2

3

aeae

ae

cc

cc

cc

DD

D

N3

1 N3

2 N3

3

DAG abstraction

• Absence of cycles– Allows re-execution for fault-tolerance– Simplifies scheduling: no deadlock

• Cycles can often be replaced by unrolling– Unsuitable for fine-grain inner loops

• Very popular– Databases, functional languages, …

Rewrite graph at runtime

• Loop unrolling with convergence tests• Adapt partitioning scheme at run time– Choose #partitions based on runtime data volume– Broadcast Join vs. Hash Join, etc.

• Adaptive aggregation and distribution trees– Based on data skew and network topology

• Load balancing– Data/processing skew (cf work-stealing)

Push vs Pull

• Databases typically ‘pull’ using iterator model– Avoids buffering– Can prevent unnecessary computation

• But DAG must be fully materialized– Complicates rewriting– Prevents resource virtualization in shared cluster

S1 S1’GD

S2’GS2 D

r

r

Fault tolerance

• Buffer data in (some) edges• Re-execute on failure using buffered data• Speculatively re-execute for stragglers• ‘Push’ model makes this very simple

Dryad

• General-purpose execution engine– Batch processing on immutable datasets– Well-tested on large clusters

• Automatically handles– Fault tolerance– Distribution of code and intermediate data– Scheduling of work to resources

Dryad System Architecture

R

Scheduler

Dryad System Architecture

R

R

Scheduler

Dryad Job Model

• Directed acyclic graph (DAG)• Clean abstraction– Hides cluster services– Clients manipulate graphs

• Flexible and expressive– General-purpose programs– Complicated execution plans

Dryad Inputs and Outputs

• Partitioned data set– Records do not cross partition boundaries– Data on compute machines: NTFS, SQLServer, …

• Optional semantics– Hash-partition, range-partition, sorted, etc.

• Loading external data– Partitioning “automatic”– File system chooses sensible partition sizes– Or known partitioning from user

Channel abstraction

S1 G

S1’G

D

S2’G

S2 G D

S3 G D

r

r

ir

ir

ir

Push vs Pull

• Channel types define connected component– Shared-memory or TCP must be gang-scheduled

• Pull within gang, push between gangs

MapReduce (Hadoop)

• MapReduce restricts– Topology of DAG– Semantics of function in compute vertex

• Sequence of instances for non-trivial tasks

G

S1’G

D

S2’G

G D

G D

r

r

ir

ir

ir

S1

S2

S3

f

f

f

MapReduce complexity

• Simple to describe MapReduce system• Can be hard to map algorithm to framework– cf k-means: combine C+P, broadcast C, iterate, …– HIVE, PigLatin etc. mitigate programming issues

• Implementation not uniform– Different fault-tolerance for mappers, reducers– Add more special cases for performance• Hadoop introducing TCP channels, pipelines, …

– Dryad has same state machine everywhere

Discussion

• DAG abstraction supports many computations– Can be targeted by high-level languages!– Run-time rewriting extends applicability

• DAG-structured jobs scale to large clusters– Over 10k computers in large Dryad clusters– Transient failures common, disk failures daily

• Trade off fault-tolerance against performance– Buffer vs TCP, still manual choice in Dryad system– Also external vs in-memory working set

Conclusion

• Dryad well-tested, scalable– Daily use supporting Bing for over 3 years

• Applicable to large number of computations– 250 computer cluster at MSR SVC, Mar->Nov 09• 47 distinct users (~50 lab members + interns)• 15k jobs (tens of millions of processes executed)• Hundreds of distinct programs

– Network trace analysis, privacy-preserving inference, light-transport simulation, decision-tree training, deep belief network training, image feature extraction, …

distributed computing using dryad michael isard microsoft research silicon valley

Documents

output x

group x

count of g slide

w slide

count of g s g s r slide

w s f s slide

count c

word w output line x