machine learning in dryadlinq

24
Machine Learning in DryadLINQ Kannan Achan Mihai Budiu MSR-SVC, 1/30/2008 1

Upload: weylin

Post on 05-Jan-2016

37 views

Category:

Documents


2 download

DESCRIPTION

Machine Learning in DryadLINQ. Kannan Achan Mihai Budiu MSR-SVC, 1/30/2008. Goal. The Software Stack. Data analysis. Machine learning. Large Vector. DryadLINQ. Dryad. Distributed Filesystem : Cosmos. Cluster Services. Windows Server. Windows Server. Windows Server. Dryad. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning in DryadLINQ

1

Machine Learning in DryadLINQKannan AchanMihai Budiu

MSR-SVC, 1/30/2008

Page 2: Machine Learning in DryadLINQ

2

Goal

Page 3: Machine Learning in DryadLINQ

3

The Software Stack

Windows Server

Cluster Services

Distributed Filesystem: Cosmos

Dryad

DryadLINQ

Windows Server

Windows Server

Large Vector

Machine learningData analysis

Page 4: Machine Learning in DryadLINQ

4

Dryad

Page 5: Machine Learning in DryadLINQ

5

Dryad Jobs

R R

X X X

M M M

X X

M

M Vertices (processes)

Channels

Output files

Input files

Stage

M

R R

X

Page 6: Machine Learning in DryadLINQ

6

LINQ and C#

Page 7: Machine Learning in DryadLINQ

7

LINQ

Collection<T> collection;bool IsLegal(Key);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

Page 8: Machine Learning in DryadLINQ

8

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

DryadLINQ = LINQ + Dryad

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)Data

Page 9: Machine Learning in DryadLINQ

9

Recall: The Software Stack

Windows Server

Cluster Services

Distributed Filesystem: Cosmos

Dryad

DryadLINQ

Windows Server

Windows Server

Large Vector

Machine learningData analysis

Page 10: Machine Learning in DryadLINQ

10

Very Large Vector LibraryPartitionedVector<T>

T

Scalar<T>

T T

T

Page 11: Machine Learning in DryadLINQ

11

Operations on Large Vectors: Map 1

U

T

T Uf

f

f preserves partitioning

Page 12: Machine Learning in DryadLINQ

12

V

Map 2 (Pairwise)

T Uf

V

U

T

f

Page 13: Machine Learning in DryadLINQ

13

Map 3 (Vector-Scalar)T U

fV

V

13

U

T

f

Page 14: Machine Learning in DryadLINQ

Reduce (Fold)

14

U UU

U

f

f f f

fU U U

U

Page 15: Machine Learning in DryadLINQ

15

Linear Algebra

T U Vnmm ,,=, ,

T

Page 16: Machine Learning in DryadLINQ

16

Linear Regression

• Data

• Find

• S.t.

mt

nt yx ,

mnA

tt yAx

},...,1{ nt

Page 17: Machine Learning in DryadLINQ

17

Analytic Solution

X×XT X×XT X×XT Y×XT Y×XT Y×XT

Σ

X[0] X[1] X[2] Y[0] Y[1] Y[2]

Σ

[ ]-1

*

A

1))(( Ttt t

Ttt t xxxyA

Map

Reduce

Page 18: Machine Learning in DryadLINQ

18

Linear Regression Code1))(( T

tt tTtt t xxxyA

Matrices xx = x.PairwiseOuterProduct(x);OneMatrix xxs = xx.Sum();Matrices yx = y.PairwiseOuterProduct(x);OneMatrix yxs = yx.Sum();OneMatrix xxinv = xxs.Map(a => a.Inverse());OneMatrix A = yxs.Map(

xxinv, (a, b) => a.Multiply(b));

Page 19: Machine Learning in DryadLINQ

Expectation Maximization

19

• 160 lines • 3 iterations shown

Page 20: Machine Learning in DryadLINQ

20

Understanding Botnet Traffic using EM

• 3 GB data• 15 clusters• 60 computers• 50 iterations• 9000 processes• 50 minutes

Page 21: Machine Learning in DryadLINQ

21

Conclusions

• Dryad simplifies programming large clusters

• DryadLINQ = declarative programming for Dryad jobs

• The Large Vector library provides simple mathematical primitiveson top of DryadLINQ

• Matlab-style coding for writing distributed numeric computations

WinCluster Services

Distributed FilesystemDryadDryadLINQ

Win Win

Large VectorMLData analysis

Page 22: Machine Learning in DryadLINQ

22

Backup Slides

Page 23: Machine Learning in DryadLINQ

23

Chaining

X×XT X×XT X×XT Y×XT Y×XT Y×XT

Σ

X[0] X[1] X[2] Y[0] Y[1] Y[2]

Σ

[ ]-1

*

A

1))(( Ttt t

Ttt t xxxyA

Σ Σ Σ Σ Σ Σ

Page 24: Machine Learning in DryadLINQ

24

EM StructureE stage

Input size

π

σ

μ

All parameters