dryadlinq a system for general-purpose distributed data-parallel computing

23
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Microsoft Research Silicon Valley Presented by: TD (Tathagata Das)

Upload: tamber

Post on 23-Feb-2016

76 views

Category:

Documents


0 download

DESCRIPTION

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ú lfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Microsoft Research Silicon Valley Presented by: TD (Tathagata Das). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

DryadLINQA System for General-Purpose

Distributed Data-Parallel Computing

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

Microsoft Research Silicon Valley

Presented by: TD (Tathagata Das)

Page 2: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Designing a general purpose language for writing distributed data-parallel programs for a

compute cluster

General purposeSingle-thread abstraction

Familiar language / environment

Page 3: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

???

Dryad

Cluster

Shell script

Shell

Machine≈

Dryad = Execution Engine

Page 4: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

• Nebula – limited to existing binaries

• Scope – SQL-ish, not general purpose

• Can we do better? – Can we get the general purpose-ness of C#/Java and

conciseness of SQL? – And at the same time, be efficient too?

Can I have my cake and eat it too!

Page 5: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Language Integrated Query (LINQ)

Page 6: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Language Integrated Query (LINQ)

• The creamy goodness of SQL-like queries within a declarative programming model

• Basic abstraction - collections

“All the world’s a collection, And all the men and women merely iterate on

collections”

- implied by Shakespeare

Page 7: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Collections, Iterators and LINQ

IEnumerable <T>

+LINQ

=>

IEnumerable <T>

=>

import system.linq;var result = from num in numbers

where num % 2 == 0 orderby num select num;

List<int> result = new List<int>();foreach (int num in numbers) {

if (num % 2 == 0)result.Add(num);

}result.sort();

Page 8: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Syntactical sweetness of LINQvar result = from num in numbers

where num % 2 == 0

orderby num select num;

var result = numbers

.Where(num => num % 2 == 0)

.OrderBy(n => n);

Query Style

Method Style

Page 9: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

LINQ Functionality

• Select / SelectMany

• Where

• GroupBy

• OrderBy

• Join

• Union / Intersect / Except

• …

Map (1-to-1 / 1-to-many)

Filter

Reduce

Sort

Join

Set operations

Page 10: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

LINQ Providers

SQLXML

…GoogleWikipediaTwitter

• Select / SelectMany

• Where

• GroupBy

• OrderBy

• Join

• Union / Intersect / Except

• …

Page 11: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

LINQ System Architecture

.NetProgram

LINQProviderInterface

Query

Objects

LINQ-to-SQL

LINQ-to-XML

PLINQ

DryadLINQ

Page 12: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Parallel Collections

Partition

Collection

Simplest example: GFS/HDFS file

Page 13: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Dryad + LINQ = DryadLINQstring uri = @"file://\\machine\directory\input.pt";PartitionedTable<LineRecord> input =

PartitionedTable.Get<LineRecord>(uri);

var lengths = input.Select(line => line.ToString().Length);

Page 14: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Word Count with DryadLINQstring uri = @"file://\\machine\directory\input.pt";PartitionedTable<LineRecord> input =

PartitionedTable.Get<LineRecord>(uri);

string separator = ",";var words = input.SelectMany(x => SplitLineRecord(separator));

var groups = words.GroupBy(x => x);

var counts = groups.Select(x => new Pair(x.Key, x.Count()));

var ordered = counts.OrderByDescending(x => x[2]);

var top = ordered.Take(k);

top.ToDryadPartitionedTable("matching.pt");

Get

SM

G

S

O

Take

Exec

ution

Pla

n Gr

aph

Page 15: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

DryadLINQ Word Count Dryad

SM

G

S

O

SM

D

MS

G

S

SM

D

MS

G

S

SM

D

MS

G

S

G G G

D D D

MS MS MS

SM

D

MS

G

S

G

D

MS

Exec

ution

Pla

n Gr

aph

Data

Flo

w G

raph

Dist

ribut

ed D

ata

Flow

Gra

ph

Page 16: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

DryadLINQ Architecture [1]

DryadLINQ

Client machine

DistributedQuery Plan.Net

Programs

Query Expr

Cluster

Output Tables

Input Tables

Query

Dryad ExecutionDryad JM

Vertexcode

Con-text

Page 17: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

DryadLINQ Code Generationstring uri = @"file://\\machine\directory\input.pt";PartitionedTable<LineRecord> input =

PartitionedTable.Get<LineRecord>(uri);

string separator = ",";var words = input.SelectMany(x => SplitLineRecord(separator));

var groups = words.GroupBy(x => x);

var counts = groups.Select(x => new Pair(x.Key, x.Count()));

var ordered = counts.OrderByDescending(x => x.count);

var top = ordered.Take(k);

top.ToDryadPartitionedTable("matching.pt");

Conversion of subexpressions to code for Dryad vertices…

1. Local variables2. Local libraries and functions

Page 18: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

DryadLINQ Architecture [2]

DryadLINQ

Client machine

(11)

DistributedQuery Plan.Net

Programs

Query Expr

Cluster

Output TablesResults

Input TablesInvoke Query

Output Partitioned-

Table

Dryad Execution

.Net Objects

Dryad JM

Vertexcode

Con-text

Page 19: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

19

Combining with LINQ-to-SQL

DryadLINQ

Subquery Subquery Subquery Subquery Subquery

Query

LINQ-to-SQL LINQ-to-SQL

Page 20: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

DryadLINQ Optimizations

• Some are similar to existing DB optimizations– Eliminate redundant partitioning steps– Aggregation steps moved up the graph, before

partitioning steps

• Existing Dryad optimizations as well– Dynamic reconfiguration of aggregation trees

Page 21: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Thoughts [1]

• Easy to read, though reads more like a PL paper

• What are system contributions that are different from Dryad?

• Does the high level abstraction provide any extra information that allow

Page 22: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Thoughts [2]

Interesting anecdote…

DryadLINQ is inefficient for random access workload, but for some workloads they outperformed systems customized for random-access

HDD performance characteristics are such that sequential read (even if you discard 99% data) is better than small random accesses

Page 23: DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

Thoughts [3]

• How different is FlumeJava from this?