dryadlinq a system for general-purpose distributed data-parallel computing yuan yu, michael isard,...

29
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Microsoft Research Silicon Valley

Post on 19-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

DryadLINQA System for General-Purpose

Distributed Data-Parallel Computing

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

Microsoft Research Silicon Valley

Page 2: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Distributed Data-Parallel Computing

• Research problem: How to write distributed data-parallel programs for a compute cluster?

• The DryadLINQ programming model– Sequential, single machine programming abstraction– Same program runs on single-core, multi-core, or cluster– Familiar programming languages– Familiar development environment

Page 3: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

DryadLINQ Overview

Automatic query plan generation by DryadLINQ Automatic distributed execution by Dryad

Page 4: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

LINQ• Microsoft’s Language INtegrated Query

– Available in Visual Studio products• A set of operators to manipulate datasets in .NET

– Support traditional relational operators• Select, Join, GroupBy, Aggregate, etc.

– Integrated into .NET programming languages• Programs can call operators• Operators can invoke arbitrary .NET functions

• Data model– Data elements are strongly typed .NET objects– Much more expressive than SQL tables

• Highly extensible– Add new custom operators– Add new execution providers

Page 5: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

LINQ System Architecture

PLINQ

Local machine

.Netprogram(C#, VB, F#, etc)

Execution engines

Query

Objects

LINQ-to-SQL

DryadLINQ

LINQ-to-ObjLIN

Q p

rovi

der i

nter

face

Scalability

Single-core

Multi-core

Cluster

Page 6: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

6

Dryad System Architecture

Files, TCP, FIFO, Networkjob schedule

data plane

control plane

NS PD PDPD

V V V

Job manager cluster

Page 7: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

A Simple LINQ Example: Word Count

Count word frequency in a set of documents:

var docs = [A collection of documents];var words = docs.SelectMany(doc => doc.words);var groups = words.GroupBy(word => word);var counts = groups.Select(g => new WordCount(g.Key, g.Count()));

Page 8: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Word Count in DryadLINQ

Count word frequency in a set of documents:

var docs = DryadLinq.GetTable<Doc>(“file://docs.txt”);var words = docs.SelectMany(doc => doc.words);var groups = words.GroupBy(word => word);var counts = groups.Select(g => new WordCount(g.Key, g.Count()));

counts.ToDryadTable(“counts.txt”);

Page 9: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Distributed Execution of Word Count

SM

DryadLINQGB

S

LINQ expression

IN

OUT

Dryad execution

Page 10: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

DryadLINQ System Architecture

10

DryadLINQClient machine

(11)

Distributedquery plan

.NET program

Query Expr

Data center

Output TablesResults

Input TablesInvoke Query

Output DryadTable

Dryad Execution

.Net Objects

JM

ToTable

foreach

Vertexcode

Page 11: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

DryadLINQ Internals• Distributed execution plan

– Static optimizations: pipelining, eager aggregation, etc.– Dynamic optimizations: data-dependent partitioning,

dynamic aggregation, etc.

• Automatic code generation– Vertex code that runs on vertices– Channel serialization code– Callback code for runtime optimizations– Automatically distributed to cluster machines

• Separate LINQ query from its local context– Distribute referenced objects to cluster machines– Distribute application DLLs to cluster machines

Page 12: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

12

Execution Plan for Word Count

(1)

SM

GB

S

SM

Q

GB

C

D

MS

GB

Sum

SelectMany

sort

groupby

count

distribute

mergesort

groupby

Sum

pipelined

pipelined

Page 13: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

13

Execution Plan for Word Count

(1)

SM

GB

S

SM

Q

GB

C

D

MS

GB

Sum

(2)

SM

Q

GB

C

D

MS

GB

Sum

SM

Q

GB

C

D

MS

GB

Sum

SM

Q

GB

C

D

MS

GB

Sum

Page 14: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

14

MapReduce in DryadLINQ

MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs{ var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs}

Page 15: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Map-Reduce Plan(When reduce is combiner-enabled)

M

Q

G1

C

D

MS

G2

R

M

Q

G1

C

D

MS

G2

R

M

Q

G1

C

D

MS

G2

R

MS

G2

R

map

sort

groupby

combine

distribute

mergesort

groupby

reduce

mergesort

groupby

reducem

apD

ynam

ic a

ggre

gatio

nre

duce

Page 16: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

An Example: PageRankRanks web pages by propagating scores along hyperlink structure

Each iteration as an SQL query:

1. Join edges with ranks2. Distribute ranks on edges3. GroupBy edge destination4. Aggregate into ranks5. Repeat

Page 17: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

One PageRank Step in DryadLINQ// one step of pagerank: dispersing and re-accumulating rankpublic static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks){ // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);

// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}

Page 18: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

The Complete PageRank Program

var pages = DryadLinq.GetTable<Page>(“file://pages.txt”); var ranks = pages.Select(page => new Rank(page.name, 1.0)); // repeat the iterative computation several times for (int iter = 0; iter < iterations; iter++) { ranks = PRStep(pages, ranks); }

ranks.ToDryadTable<Rank>(“outputranks.txt”);

public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links;

public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; }

public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } }

public struct Rank { public UInt64 name; public double rank;

public Rank(UInt64 n, double r) { name = n; rank = r; } }

public static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank);

// re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum());}

Page 19: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

One Iteration PageRank

J

S

G

C

D

M

G

R

J

S

G

C

D

M

G

R

J

S

G

C

D

Join pages and ranks

Disperse page’s rank

Group rank by page

Accumulate ranks, partially

Hash distribute

Merge the data

Group rank by page

Accumulate ranks

M

G

R

Dynamic aggregation

Page 20: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Multi-Iteration PageRankpages ranks

Iteration 1

Iteration 2

Iteration 3

Memory FIFO

Page 21: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

LINQ System Architecture

PLINQ

Local machine

.Netprogram(C#, VB, F#, etc)

Execution engines

Query

Objects

LINQ-to-SQL

DryadLINQ

LINQ-to-ObjLIN

Q p

rovi

der i

nter

face

Scalability

Single-core

Multi-core

Cluster

Page 22: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

22

Combining with PLINQ

Query

DryadLINQ

PLINQ

subquery

Page 23: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

23

Combining with LINQ-to-SQL

DryadLINQ

Subquery Subquery Subquery Subquery Subquery

Query

LINQ-to-SQL LINQ-to-SQL

Page 24: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Combining with LINQ-to-Objects

Query

DryadLINQ

Local machine

Cluster

LINQ-to-Object

debug

production

Page 25: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Current Status

• Works with any LINQ enabled language– C#, VB, F#, IronPython, …

• Works with multiple storage systems– NTFS, SQL, Windows Azure, Cosmos DFS

• Released internally within Microsoft– Used on a variety of applications

• External academic release announced at PDC– DryadLINQ in source, Dryad in binary– UW, UCSD, Indiana, ETH, Cambridge, …

Page 26: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

26

ImageProcessing

Cosmos DFSSQL Servers

Software Stack

Windows Server

Cluster Services

Azure Platform

Dryad

DryadLINQ

Windows Server

Windows Server

Windows Server

Other Languages

CIFS/NTFS

MachineLearning

GraphAnalysis

DataMining

Applications

…Other Applications

Page 27: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Lessons• Deep language integration worked out well

– Easy expression of massive parallelism– Elegant, unified data model based on .NET objects– Multiple language support: C#, VB, F#, …– Visual Studio and .NET libraries– Interoperate with PLINQ, LINQ-to-SQL, LINQ-to-Object, …

• Key enablers– Language side

• LINQ extensibility: custom operators/providers• .NET reflection, dynamic code generation, …

– System side• Dryad generality: DAG model, runtime callback• Clean separation of Dryad and DryadLINQ

Page 28: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Future Directions• Goal: Use a cluster as if it is a single computer

– Dryad/DryadLINQ represent a modest step

• On-going research– What can we write with DryadLINQ?

• Where and how to generalize the programming model?

– Performance, usability, etc.• How to debug/profile/analyze DryadLINQ apps?

– Job scheduling• How to schedule/execute N concurrent jobs?

– Caching and incremental computation• How to reuse previously computed results?

– Static program checking• A very compelling case for program analysis?• Better catch bugs statically than fighting them in the cloud?

Page 29: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep

Conclusions

A powerful, elegant programming environment for large-scale data-parallel computing

To request a copy of Dryad/DryadLINQ, contact [email protected]

For academic use only

See a demo of the system at the poster session!