from linq to dryadlinq

30
From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ

Upload: kamaria-gyasi

Post on 30-Dec-2015

36 views

Category:

Documents


7 download

DESCRIPTION

From LINQ to DryadLINQ. Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ. Overview. From sequential code to parallel execution Dryad fundamentals Simple program example, plan for practicals. Distributed computation. Single computer, shared memory - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: From LINQ to  DryadLINQ

From LINQ to DryadLINQ

Michael IsardWorkshop on Data-Intensive Scientific

Computing Using DryadLINQ

Page 2: From LINQ to  DryadLINQ

Overview

• From sequential code to parallel execution• Dryad fundamentals• Simple program example, plan for practicals

Page 3: From LINQ to  DryadLINQ

Distributed computation

• Single computer, shared memory– All objects always available for read and write

• Cluster of workstations– Each computer sees a subset of objects– Writes on one computer must be explicitly shared

• System automatically handles complexity– Needs some help

Page 4: From LINQ to  DryadLINQ

Data-parallel computation

• LINQ is high-level declarative specification• Same action on entire collection of objects• set.Select(x => f(x))– Compute f(x) on each x in set, independently

• set.GroupBy(x => key(x))– Group by unique keys, independently

• set.OrderBy(x => key(x))– Sort whole set (system chooses how)

Page 5: From LINQ to  DryadLINQ

Distributed cluster computing

• Dataset is stored on local disks of cluster

setset.0set.7

set.1set.6set.4

set.3set.2set.5

Page 6: From LINQ to  DryadLINQ

Distributed cluster computing

• Dataset is stored on local disks of cluster

set.0set.7

set.1set.6set.4

set.3set.2set.5

Page 7: From LINQ to  DryadLINQ

Simple distributed computation

var set2 = set.Select(x => f(x))

set

set2

Page 8: From LINQ to  DryadLINQ

Simple distributed computation

var set2 = set.Select(x => f(x))

set.0

set.7set.1

set.6 set.4

set.3

set.2

set.5

set2.0

set2.1

set2.2

set2.3

set2.4

set2.5

set2.6

set2.7

Page 9: From LINQ to  DryadLINQ

Simple distributed computation

var set2 = set.Select(x => f(x))

set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7

set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

f f f f f f f f

Page 10: From LINQ to  DryadLINQ

Simple distributed computation

var set2 = set.Select(x => f(x))

set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7

set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

f f f f f f f f

Page 11: From LINQ to  DryadLINQ

Distributed acyclic graph

• Computation reads and writes along edges• Graph shows parallelism via independence• Goals of DryadLINQ optimizer– Extract parallelism (find independent work)– Control data skew (balance work across nodes)– Limit cross-computer data transfer

Page 12: From LINQ to  DryadLINQ

Distributed grouping

var groups = set.GroupBy(x => x.key)

• set is a collection of records each with a key• Don’t know what keys are present– Or in which partitions

• First, reorganize data– All records with same key on same computer

• Then can do final grouping in parallel

Page 13: From LINQ to  DryadLINQ

Distributed grouping

var groups = set.GroupBy(x => x.key)

set

hash partition by key

group locally

groups

ac

ad

db

ba

ac

a caa

ad

dd bb

db

ba

Page 14: From LINQ to  DryadLINQ

Distributed grouping

var groups = set.GroupBy(x => x.key)

set

hash partition by key

group locally

groups

ac

ad

db

ba

ac

a caa

ad

dd bb

db

ba

a a ac

b bd d

a a ac

b bd d

Page 15: From LINQ to  DryadLINQ

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

1001

11

23

41

1001

1001

11

23

31

41

Page 16: From LINQ to  DryadLINQ

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

1001

11

23

41

1001

1001

11

23

31

41

[1,1][2,100]

1001

11 11

1002 34

11

23

41

Page 17: From LINQ to  DryadLINQ

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

1001

11

23

41

1001

11

23

41

[1,1][2,100]

1001

11 11

1002 34

11

23

41

1 1 1 1 2 3 4 100

1 1 1 1 2 3 4 100

Page 18: From LINQ to  DryadLINQ

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a bb a

a aa aa a

a ad d

b bd d

b d b db b

b db d

a bb a

count

Page 19: From LINQ to  DryadLINQ

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a a a a a a

a a a a a a count

a bb a

a ad d

b db d

a bb a

a bb a

a ad d

b db d

a bb a

a aa aa a

b bd d

b d b db b

b b b b b bd d d d

b b b b b bd d d d

Page 20: From LINQ to  DryadLINQ

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histograma,6 b,6d,4

count

a bb a

a ad d

b db d

a bb a

a bb a

a ad d

b db d

a bb a

a a a a a a b b b b b bd d d d

a a a a a aa,6b,6d,4b b b b b b

d d d d

Page 21: From LINQ to  DryadLINQ

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2a,2a,2

a,2d,2

b,2d,2

b,2 d,2b,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

Page 22: From LINQ to  DryadLINQ

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2a,2a,2

a,2d,2

b,2d,2

b,2 d,2b,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

Page 23: From LINQ to  DryadLINQ

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2d,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

a,6

a,6

b,6 d,4

b,6 d,4

Page 24: From LINQ to  DryadLINQ

What Dryad does

• Abstracts cluster resources– Set of computers, network topology, etc.

• Schedule DAG: choose cluster computers– Fairly among competing jobs– So computation is close to data

• Recovers from transient failures– Rerun computations on machine or network fault– Speculate duplicates for slow computations

Page 25: From LINQ to  DryadLINQ

Resources are virtualized

• Each graph node is process– Writes outputs to disk– Reads inputs from upstream nodes’ output files

• Graph generally larger than cluster– 1TB input, 250MB partition, 4000 parts

• Cluster is shared– Don’t size program for exact cluster– Use whatever share of resources are available

Page 26: From LINQ to  DryadLINQ

What controls parallelism

• Initially based on partitioning of inputs

• After reorganization, system or user decides

Page 27: From LINQ to  DryadLINQ

DryadLINQ-specific operators

• set = PartitionedTable.Get<T>(uri)• set.ToPartitionedTable(uri)• set.HashPartition(x => f(x), numberOfParts)• set.AssumeHashPartition(x => f(x))• [Associative] f(x) { … }• RangePartition(…), Apply(…), Fork(…)• [Decomposable], [Homomorphic], [Resource]• Field mappings, Multiple partitioned tables, …

Page 28: From LINQ to  DryadLINQ

using System;using System.Collections.Generic;using System.Linq;using System.Text;using LinqToDryad;

namespace Count { class Program { public const string inputUri = @"tidyfs://datasets/Count/inputfile1.pt"; static void Main(string[] args) { PartitionedTable<LineRecord> table = PartitionedTable.Get<LineRecord>(inputUri); Console.WriteLine("Lines: {0}", table.Count()); Console.ReadKey(); } }}

Page 29: From LINQ to  DryadLINQ

Form into groups

• 9 groups, one MSRI member per group• Try to pick common interest for project later

Page 30: From LINQ to  DryadLINQ

sherwood-246 — sherwood-253,sherwood-255

d:\dryad\data\Workshop\DryadLINQ\samplesCount, Points, Robots

Cluster job browser d:\dryad\data\Workshop\DryadLINQ\job_browser\DryadAnalysis.exe

TidyFS (file system) browserd:\dryad\data\Workshop\DryadLINQ\bin\retail\tidyfsexplorerwpf.exe