machine learning in dryadlinq

24
Machine Learning in DryadLINQ Kannan Achan Mihai Budiu MSR-SVC, 1/30/2008 1

Upload: quinlan-rutledge

Post on 31-Dec-2015

27 views

Category:

Documents


0 download

DESCRIPTION

Machine Learning in DryadLINQ. Kannan Achan Mihai Budiu MSR-SVC, 1/30/2008. Goal. The Software Stack. Data analysis. Machine learning. Large Vector. DryadLINQ. Dryad. Distributed Filesystem : Cosmos. Cluster Services. Windows Server. Windows Server. Windows Server. Dryad. - PowerPoint PPT Presentation

TRANSCRIPT

1

Machine Learning in DryadLINQKannan AchanMihai Budiu

MSR-SVC, 1/30/2008

2

Goal

3

The Software Stack

Windows Server

Cluster Services

Distributed Filesystem: Cosmos

Dryad

DryadLINQ

Windows Server

Windows Server

Large Vector

Machine learningData analysis

4

Dryad

5

Dryad Jobs

R R

X X X

M M M

X X

M

M Vertices (processes)

Channels

Output files

Input files

Stage

M

R R

X

6

LINQ and C#

7

LINQ

Collection<T> collection;bool IsLegal(Key);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

8

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

DryadLINQ = LINQ + Dryad

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)Data

9

Recall: The Software Stack

Windows Server

Cluster Services

Distributed Filesystem: Cosmos

Dryad

DryadLINQ

Windows Server

Windows Server

Large Vector

Machine learningData analysis

10

Very Large Vector LibraryPartitionedVector<T>

T

Scalar<T>

T T

T

11

Operations on Large Vectors: Map 1

U

T

T Uf

f

f preserves partitioning

12

V

Map 2 (Pairwise)

T Uf

V

U

T

f

13

Map 3 (Vector-Scalar)T U

fV

V

13

U

T

f

Reduce (Fold)

14

U UU

U

f

f f f

fU U U

U

15

Linear Algebra

T U Vnmm ,,=, ,

T

16

Linear Regression

• Data

• Find

• S.t.

mt

nt yx ,

mnA

tt yAx

},...,1{ nt

17

Analytic Solution

X×XT X×XT X×XT Y×XT Y×XT Y×XT

Σ

X[0] X[1] X[2] Y[0] Y[1] Y[2]

Σ

[ ]-1

*

A

1))(( Ttt t

Ttt t xxxyA

Map

Reduce

18

Linear Regression Code1))(( T

tt tTtt t xxxyA

Matrices xx = x.PairwiseOuterProduct(x);OneMatrix xxs = xx.Sum();Matrices yx = y.PairwiseOuterProduct(x);OneMatrix yxs = yx.Sum();OneMatrix xxinv = xxs.Map(a => a.Inverse());OneMatrix A = yxs.Map(

xxinv, (a, b) => a.Multiply(b));

Expectation Maximization

19

• 160 lines • 3 iterations shown

20

Understanding Botnet Traffic using EM

• 3 GB data• 15 clusters• 60 computers• 50 iterations• 9000 processes• 50 minutes

21

Conclusions

• Dryad simplifies programming large clusters

• DryadLINQ = declarative programming for Dryad jobs

• The Large Vector library provides simple mathematical primitiveson top of DryadLINQ

• Matlab-style coding for writing distributed numeric computations

WinCluster Services

Distributed FilesystemDryadDryadLINQ

Win Win

Large VectorMLData analysis

22

Backup Slides

23

Chaining

X×XT X×XT X×XT Y×XT Y×XT Y×XT

Σ

X[0] X[1] X[2] Y[0] Y[1] Y[2]

Σ

[ ]-1

*

A

1))(( Ttt t

Ttt t xxxyA

Σ Σ Σ Σ Σ Σ

24

EM StructureE stage

Input size

π

σ

μ

All parameters