dryadlinq: computer vision (among other things) on a cluster eccv ac workshop 14 th june, 2008...

27
DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Post on 18-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

DryadLINQ: Computer Vision (among other things) on a cluster

ECCV AC workshop 14th June, 2008

Michael Isard

Microsoft Research, Silicon Valley

Page 2: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Parallel programming, yada yada

• Intel claims we will all have many-core, etc.• “This algorithm is easily parallelizable”

– Not “we implemented a parallel version”

• Historically, low-latency fine-grain parallelism– Shared-memory SMP (threads, locks, etc.)– MPI (finite-element analysis, etc.)

• But also data-parallel!– We have lots of data now (video, the web)– But most people still use their laptops/toy data– Even “big” systems use tens of computers

Page 3: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Why do people use Matlab?

• Parallel programming tedious and complex– Distributed programming even worse– Perl scripts, manual management of data, …

• Matlab is easy (or at least popular)– Relatively few high-level constructs– System “does the right thing”– Programmers willing to put up with a lot

• We want similarly low barrier to entry– Familiar languages, legacy codebase, etc.

Page 4: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

What are we doing?

• When single-computer processing runs out of steam– Web-scale processing of terabytes of data

• Infeasible without a big cluster

– Network log-mining, machine learning• Multi-week job → 4 hours on 250 computers• 1-hour iteration → 3.5 minutes on 4 computers

Page 5: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

A typical data-intensive queryvar logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

Page 6: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Steps in the queryvar logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

Go through logs and keep only lines that are not comments. Parse each line into a LogEntry object.

Go through logentries and keep only entries that are accesses by ulfar.

Group ulfar’s accesses according to what page they correspond to. For each page, count the occurrences.

Sort the pages ulfar has accessed according to access frequency.

Page 7: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Serial executionvar logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

For each line in logs, do…

For each entry in logentries, do..

Sort entries in user by page. Then iterate over sorted list, counting the occurrences of each page as you go.

Re-sort entries in access by page frequency.

Page 8: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Parallel executionvar logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

Page 9: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Linear Regression

Vectors x = input(0), y = input(1);

Matrices xx = x.PairwiseOuterProduct(x);

OneMatrix xxs = xx.Sum();

Matrices yx = y.PairwiseOuterProduct(x);

OneMatrix yxs = yx.Sum();

OneMatrix xxinv = xxs.Map(a => a.Inverse());

OneMatrix A = yxs.Map(xxinv, (a, b) => a.Multiply(b));

9

1))(( Ttt t

Ttt t xxxyA

Page 10: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Execution Graph

10

X×XT X×XT X×XT Y×XT Y×XT Y×XT

Σ

X[0] X[1] X[2] Y[0] Y[1] Y[2]

Σ

[ ]-1

*

A

1))(( Ttt t

Ttt t xxxyA

Page 11: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

DryadLINQ

• Programmer writes sequential C# code– Rich type system, libraries, modules, loops…– System can figure out data-parallelism

• Sees declarative expression plans• Full control of high-level optimizations• Traditional parallel-database tricks

Page 12: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Dryad execution engine

• General-purpose execution environment for distributed, data-parallel applications– Concentrates on throughput not latency– Assumes private data center

• Automatic management of scheduling, distribution, fault tolerance, etc.

• Well tested over two years on clusters of thousands of computers

Andrew Birrell, Mihai Budiu, Dennis Fetterly,Michael Isard, Yuan Yu

Page 13: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Job = Directed Acyclic Graph

Processingvertices Channels

(file, pipe, shared memory)

Inputs

Outputs

Page 14: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Scheduler state machine

• Scheduling a DAG– Vertex can run anywhere once all its inputs

are ready• Constraints/hints place it near its inputs

– Fault tolerance• If A fails, run it again• If A’s inputs are gone, run upstream vertices again

(recursively)• If A is slow, run another copy elsewhere and use

output from whichever finishes first

Page 15: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Static/dynamic optimizations

• Static optimizer builds execution graph

• Dynamic optimizer mutates running graph– Picks number of partitions when size is known– Builds aggregation trees based on locality

Page 16: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

LINQ

• Constructs/type system in .NET v3.5

• Operators to manipulate datasets– Data elements are arbitrary .NET types

• Traditional relational operators– Select, Join, Aggregate, etc.

• Extensible– Add new operators– Add new implementations

Page 17: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

DryadLINQ

• Automatically distribute a LINQ program

• Few Dryad-specific extensions– Same source program runs on single-core

through multi-core up to cluster

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

Page 18: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

A complete DryadLINQ programpublic class LogEntry { public string user; public string ip; public string page;

public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; } }

public class UserPageCount { public string user; public string page; public int count;

public UserPageCount(string user, string page, int count) { this.user = user; this.page = page; this.count = count; } }

DryadDataContext ddc = new DryadDataContext(“fs://logfile”);DryadTable<string> logs = ddc.GetTable<string>();var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToDryadTable(“fs://results”)

Page 19: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Query plan LINQ query

DryadLINQ: From LINQ to Dryad

Dryad

select

where

logs

Automatic query plan generation

Distributed query execution by Dryad

var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);

Page 20: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

How does it work?

• Sequential code “operates” on datasets

• But really just builds an expression graph– Lazy evaluation

• When a result is retrieved– Entire graph is handed to DryadLINQ– Optimizer builds efficient DAG– Program is executed on cluster

Page 21: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Terasort

• 10 billion 100-byte records (1012 bytes)

• 240 computers, 960 disks– 349 secs

• Comparable with record

public struct TeraRecord : IComparable<TeraRecord> { public const int RecordSize = 100; public const int KeySize = 10; public byte[] content; public int CompareTo(TeraRecord rec) { for (int i = 0; i < KeySize; i++) { int cmp = this.content[i] - rec.content[i]; if (cmp != 0) return cmp; } return 0; } public static TeraRecord Read(DryadBinaryReader rd) { TeraRecord rec; rec.content = rd.ReadBytes(RecordSize); return rec; } public static int Write(DryadBinaryWriter wr, TeraRecord rec) { return wr.WriteBytes(rec.content); } } class Terasort { public static void Main(string[] args) DryadDataContext ddc = new DryadDataContext(@"file://\\svc-yuanbyu-00\dryad\terasort"); DryadTable<TeraRecord> records = ddc.GetPartitionedTable<TeraRecord>("sherwood-sort2.pt");

var q = records.OrderBy(x => x); q.ToDryadPartitionedTable("sherwood-sort2.pt"); } }

Page 22: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Machine Learning in DryadLINQ

22

Dryad

DryadLINQ

Large Vector

Machine learningData analysis

Kannan Achan, Mihai Budiu

Page 23: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Linear Regression Code

Vectors x = input(0), y = input(1);

Matrices xx = x.PairwiseOuterProduct(x);

OneMatrix xxs = xx.Sum();

Matrices yx = y.PairwiseOuterProduct(x);

OneMatrix yxs = yx.Sum();

OneMatrix xxinv = xxs.Map(a => a.Inverse());

OneMatrix A = yxs.Map(xxinv, (a, b) => a.Multiply(b));

23

1))(( Ttt t

Ttt t xxxyA

Page 24: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Expectation Maximization

24

• 160 lines • 3 iterations shown

Page 25: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Computer vision

• Ongoing– Epitomes, features for image search, …

• Anecdotal evidence– Nebojsa Jojic, Anitha Kannan

• Tutorial from Mihai• Anitha implemented Probabilistic Image Map

algorithm in an afternoon

Page 26: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Continuing research

• Application-level research– What can we write with DryadLINQ?

• System-level research– Performance, usability, etc.

• Lots of interest from learning/vision researchers

Page 27: DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley