cluster computing with dryad - ieeeewh.ieee.org/r6/scv/computer/nfic/2008/microsoft... · cluster...

42
Cluster Computing with DryadLINQ Mihai Budiu Microsoft Research Silicon Valley IEEE Cloud Computing Conference July 19, 2008

Upload: others

Post on 28-May-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Cluster Computing with DryadLINQ

Mihai BudiuMicrosoft Research Silicon Valley

IEEE Cloud Computing ConferenceJuly 19, 2008

Goal

2

Design Space

3

ThroughputLatency

Internet

Privatedata

center

Data-parallel

Sharedmemory

Data Partitioning

4

RAM

DATA

DATA

Data-Parallel Computation

5

Storage

Execution

Application

Parallel Databases

Map-Reduce

GFSBigTable

CosmosNTFS

Dryad

DryadLINQScope,PSQ

L

Sawzall, Pig

• Introduction

• Dryad

• DryadLINQ

• Applications

6

2-D Piping• Unix Pipes: 1-D

grep | sed | sort | awk | perl

• Dryad: 2-D

grep1000 | sed500 | sort1000 | awk500 | perl50

7

Virtualized 2-D Pipelines

8

Virtualized 2-D Pipelines

9

Virtualized 2-D Pipelines

10

Virtualized 2-D Pipelines

11

Virtualized 2-D Pipelines

12

• 2D DAG• multi-machine• virtualized

Architecture

13

Files, TCP, FIFO, Networkjob schedule

data plane

control plane

NS PD PDPD

V V V

Job manager cluster

Fault Tolerance

• Introduction

• Dryad

• DryadLINQ

• Applications

15

DryadLINQ

16

Dryad

17

LINQ = C# + Queries

Collection<T> collection;

bool IsLegal(Key);

string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

18

DryadLINQ = LINQ + Dryad

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)

Data

Data Model

19

Partition

Collection

C# objects

Language Summary

20

WhereSelectGroupByOrderByAggregateJoinApplyMaterialize

Demo

21Done

Example: Histogram

22

public static IQueryable<Pair> Histogram(IQueryable<LineRecord> input, int k)

{var words = input.SelectMany(x => x.line.Split(' '));var groups = words.GroupBy(x => x);var counts = groups.Select(x => new Pair(x.Key, x.Count()));var ordered = counts.OrderByDescending(x => x.count);var top = ordered.Take(k);return top;

}

“A line of words of wisdom”

[“A”, “line”, “of”, “words”, “of”, “wisdom”]

[[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]

[ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]

[{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]

[{“of”, 2}, {“A”, 1}, {“line”, 1}]

Histogram Plan

23

SelectManyHashDistribute

MergeGroupBy

Select

OrderByDescendingTake

MergeSortTake

Map-Reduce in DryadLINQ

24

public static IQueryable<S> MapReduce<T,M,K,S>(this IQueryable<T> input,Expression<Func<T, IEnumerable<M>>> mapper,Expression<Func<M,K>> keySelector,Expression<Func<IGrouping<K,M>,S>> reducer)

{var map = input.SelectMany(mapper);var group = map.GroupBy(keySelector);var result = group.Select(reducer);return result;

}

Map-Reduce Plan

25

M

R

G

M

Q

G1

R

D

MS

G2

R

static dynamic

X

X

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

MS

G2

R

MS

G2

R

map

sort

groupby

reduce

distribute

mergesort

groupby

reduce

mergesort

groupby

reduce

consumer

map

part

ial a

ggre

gatio

nre

ducedynamic

• Introduction

• Dryad

• DryadLINQ

• Applications

26

Applications

27

Dryad

DryadLINQ

Distributed Data Structures

Machine learning

GraphsLog

analysisImage

processing

Combinatorial

optimization

Raytracin

g

Dataanalysis

E.g: Linear Algebra

28

T U Vnmm ×ℜℜℜ ,,=, ,

T

Expectation Maximization (Gaussians)

29

• 160 lines • 3 iterations shown

Conclusions

• Dryad = distributed execution environment• Supports rich software ecosystem

• DryadLINQ = Compiles LINQ to Dryad• C# objects and declarative programming• .Net and Visual Studio

for distributed programming

30

Dryad Job Structure

31

grep

sed

sortawk

perlgrep

grepsed

sort

sort

awk

Inputfiles

Vertices (processes)

Outputfiles

ChannelsStage

Linear Regression

• Data

• Find

• S.t.

mt

nt yx ℜ∈ℜ∈ ,

mnA ×ℜ∈

tt yAx ≈

},...,1{ nt∈

32

Analytic Solution

33

X×XT X×XT X×XT Y×XT Y×XT Y×XT

Σ

X[0] X[1] X[2] Y[0] Y[1] Y[2]

Σ

[ ]-1

*

A

1))(( −××= ∑∑ Ttt t

Ttt t xxxyA

Map

Reduce

Linear Regression Code

Vectors x = input(0), y = input(1);

Matrices xx = x.Map(x, (a,b) => a.OuterProd(b));

OneMatrix xxs = xx.Sum();

Matrices yx = y.Map(x, (a,b) => a.OuterProd(b));

OneMatrix yxs = yx.Sum();

OneMatrix xxinv = xxs.Map(a => a.Inverse());

OneMatrix A = yxs.Map(xxinv, (a, b) => a.Mult(b));34

1))(( −××= ∑∑ Ttt t

Ttt t xxxyA

Dryad = Execution Layer

35

Job (application)

Dryad

Cluster

Pipeline

Shell

Machine

• Many similarities• Exe + app. model• Map+sort+reduce• Few policies• Program=map+reduce• Simple• Mature (> 4 years)• Widely deployed• Hadoop

Dryad Map-Reduce

• Execution layer• Job = arbitrary DAG• Plug-in policies• Program=graph gen.• Complex ( features)• New (< 2 years)• Still growing• Internal

36

PLINQ

37

public static IEnumerable<TSource>DryadSort<TSource, TKey>(IEnumerable<TSource> source,

Func<TSource, TKey> keySelector,IComparer<TKey> comparer,bool isDescending)

{return source.AsParallel().OrderBy(keySelector, comparer);

}

Operations on Large Vectors: Map 1

38

U

T

T Uf

f

f preserves partitioning

V

Map 2 (Pairwise)

39

T Uf

V

U

T

f

Map 3 (Vector-Scalar)

40

T Uf

V

V

40

U

T

f

Reduce (Fold)

41

U UU

U

f

f f f

fU U U

U

Software Stack

42

Windows Server

Cluster Services

Distributed Filesystem (Cosmos)

Dryad

Distributed Shell (Nebula)

PSQL

DryadLINQ

PerlSQL

server

C++

Windows Server

Windows Server

Windows Server

C++

CIFS/NTFS

legacycode

sed, awk, grep, etc.

SSISScope

C# Data structures

Applications

C#

Job

queu

eing

, mon

itori

ng