microsoft dryadlinq

48
Microsoft DryadLINQ --Jinling Li

Upload: manasa

Post on 23-Feb-2016

70 views

Category:

Documents


0 download

DESCRIPTION

Microsoft DryadLINQ. -- Jinling Li. What’s DryadLINQ ?. A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language . [1] Data-Parallel Computing ( large data example ) High-Level Language DryadLINQ = Dryad+LINQ. Dryad. LINQ. DryadLINQ = Dryad+LINQ. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Microsoft  DryadLINQ

Microsoft DryadLINQ--Jinling Li

Page 2: Microsoft  DryadLINQ

What’s DryadLINQ?• A System for General-Purpose Distributed Data-Parallel

Computing Using a High-Level Language.[1]

• Data-Parallel Computing (large data example)

• High-Level Language

• DryadLINQ=Dryad+LINQ

Dryad

LINQ

Page 3: Microsoft  DryadLINQ

DryadLINQ=Dryad+LINQ

Figure source: [1]

Page 4: Microsoft  DryadLINQ

Outline

• Dryad• LINQ• DryadLINQ• DryadLINQ in Machine Learning • Strengths and Weaknesses

Page 5: Microsoft  DryadLINQ

Dryad• Microsoft Dryad is a high-performance, general-purpose

distributed computing engine that handles some of the most difficult aspects of cluster-based distributed computing. [2]

• Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. [2]

• A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.[2]

Page 6: Microsoft  DryadLINQ

Dryad System Architecture• The job manager contains the application-specific code to construct the

job’s communication graph along with library code to schedule the work across the available resources. [2]

• The name server is used to enumerate all the available computers. The name server also exposes the position of each computer within the network topology . [2]

Figure source:[2]

Page 7: Microsoft  DryadLINQ

Dryad System Architecture• The job manager (JM) consults the name server (NS) to discover the list of

available computers. It maintains the job graph and schedules running vertices (V) as computers become available using the daemon (D) as a proxy. [2]

• The first time a vertex (V) is executed on a computer its binary is sent from the job manager to the daemon and subsequently it is executed from a cache. [2]

Figure source: [2]

Page 8: Microsoft  DryadLINQ

Dryad Computational Model• The basic computational model for Dryad is the directed-acyclic

graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. [3]

Figure source: [3]

Page 9: Microsoft  DryadLINQ

Software Stack• Dryad is mostly used as middleware below a high-level language

layer and low-level internal cluster infrastructure. [3]

Figure source:[3]

Page 10: Microsoft  DryadLINQ

Below Dryad

• Below Dryad is a cluster-management system that supports some low-level actions like starting a process on a remote computer, and one or more distributed storage systems that support partitioned files. [4]

• Most of the Dryad development has been done on top of a Microsoft internal cluster infrastructure called Cosmos that was developed by the Bing product group. [4]

Page 11: Microsoft  DryadLINQ

Outline

• Dryad• Introduction• Computational Model• System Architecture• Software Stack

• LINQ• DryadLINQ• DryadLINQ in Machine Learning • Strengths and Weaknesses

Page 12: Microsoft  DryadLINQ

LINQ• LINQ=Language Integrated Queries• A Microsoft .NET Framework component that adds native data

querying capabilities to .NET languages• Comprises a set of operators to manipulate collections of .Net

objects and bridges the gap between the world of objects and the world of data. [5]

Figure source: [5]

Page 13: Microsoft  DryadLINQ

LINQ• LINQ adds high level declarative data manipulation to many of

the .NET programming languages, including C#, Visual Basic and F#. [5]

• LINQ datasets are .NET collections. Technically, a .NET collection of values of type T is a data type which implements the predefined interface IEnumerable<T>[5]

• Another interface IQueryable<T> represents a query (i.e., a computation) that can produce a collection with elements of type T. [5]

Page 14: Microsoft  DryadLINQ

LINQ operatorsOperation Example Result

Where C.Where(x=>x>3) (4,5)

Select C.Select(x=>x+1) (2,3,4,5,6)

Aggregate C.Aggregate((x,y)=>x+y) 15

GroupBy C.GroupBy(x=>x%2) ((1,3,5),(2,4))

OrderBy C.OrdrBy(x=>-x) (5,4,3,2,1)

SelectMany C.SelectMany(x=>Factors(x)) (1,1,2,1,3,1,2,4,1,5)

Join C.Join(C, x=>x, x=>x-4, (x,y)=>x+y) (6)

Examples using LINQ operators on collection C={1,2,3,4,5}. Factors is a user defined function. Table source: [5]

Page 15: Microsoft  DryadLINQ

A Simple Example: Word Count• Count word frequency in a set of documents [6]

• var documents = GetDocuments();• var words = documents.SelectMany (document => document.Words);• var groups = words.GroupBy(word=>word);• var counts = groups.Select• (group => new WordCount(group.Key, group.Count()));

Page 16: Microsoft  DryadLINQ

Outline

• Dryad• LINQ• Introduction• Interfaces• Operators

• DryadLINQ• DryadLINQ in Machine Learning • Strengths and Weaknesses

Page 17: Microsoft  DryadLINQ

DryadLINQ• DryadLINQ bridges the gap between Dryad and LINQ layer. • DryadLINQ translates programs written in LINQ into Dryad job

execution plans that can be executed on a cluster by Dryad, and transparently returns the results to the host application[5].

Figure source: [5]

Page 18: Microsoft  DryadLINQ

Example: LINQ operators

Figure source: [5]

Page 19: Microsoft  DryadLINQ

2D Piping• The Dryad job execution plans generated by DryadLINQ are composable: the

output of one graph can become the input of another one. In fact, this is exactly how complex LINQ queries are translated: each operator is translated to a graph independently, and the graphs are then concatenated. [7]

Figure source: [7]

Page 20: Microsoft  DryadLINQ

A Simple Example: Word Count• Count word frequency in a set of documents[6]:

• var documents = GetDocuments();• var words = documents.SelectMany (document => document.Words);• var groups = words.GroupBy(word=>word);• var counts = groups.Select• (group => new WordCount(group.Key, group.Count()));

Page 21: Microsoft  DryadLINQ

Word Count in DryadLINQ• Count word frequency in a set of documents[6]:

• var documents = DryadLinq.GetTable<Document>• (“file://docs.txt”);• var words = documents.SelectMany• (document => document.Words);• var groups = words.GroupBy(word=>word);• var counts = groups.Select• (group => new WordCount(group.Key, group.Count()));

Page 22: Microsoft  DryadLINQ

Distributed Execution of Word Count

Figure source: [6]

Page 23: Microsoft  DryadLINQ

Another Example: extract Ulfar’s favorite web pages from many web log files

Page 24: Microsoft  DryadLINQ

DryadLINQ Execution Overview

Figure source: [2]

Page 25: Microsoft  DryadLINQ

DryadLINQ=Dryad+LINQ

Figure source: [8]

Page 26: Microsoft  DryadLINQ

Outline

• Dryad• LINQ• DryadLINQ• Introduction• DryadLINQ = Compiles LINQ to Dryad• LINQ operators and other examples

• DryadLINQ in Machine Learning • Strengths and Weaknesses

Page 27: Microsoft  DryadLINQ

DryadLINQ in Machine Learning

Page 28: Microsoft  DryadLINQ

Real-life Application: XBox

Figure source: [9]

Page 29: Microsoft  DryadLINQ

Example: k-means[5]

• K-means in LINQ

Page 30: Microsoft  DryadLINQ

K-means in DryadLINQ[5]

• How to implement the GroupBy operation at the heart of the k-means aggregation?

• DryadLINQ generates a job execution plan that uses two-level aggregation: each computer builds local groups with the local data and only sends the aggregated information about these groups to the next stage; the next stage computes the actual centroid. [5]

Page 31: Microsoft  DryadLINQ

K-means in DryadLINQ

Figure source: [5]

Page 32: Microsoft  DryadLINQ

Example: Decision Tree[5]

• Represent a decision tree with a dictionary that maps tree node indices (integer values) to attribute indices in the attribute array

• The most common algorithm to induce a decision tree starts from an empty tree and a set of records with class labels and attributes with values.

• The algorithm repeatedly extends the tree by grouping records by their current location under the partial tree, and for each such group determining the attribute resulting in the greatest reduction in conditional entropy.

• records.GroupBy(record => TreeWalk(record, tree)) .Select(group => FindBestAttribute(group));

Page 33: Microsoft  DryadLINQ

Decision Tree Induction in DryadLINQ[5]

Page 34: Microsoft  DryadLINQ

Decision Tree Induction in DryadLINQ

• Each iteration through the loop invokes a query returning the list of attribute indices that are best for each of the leaves in the old tree. the tree variable is updated on the client computer, and retransmitted to the cluster by DryadLINQ with each iteration. [5]

Page 35: Microsoft  DryadLINQ

Decision Tree Induction in DryadLINQ[5]

Figure source: [5]

Page 36: Microsoft  DryadLINQ

Example: Singular Value Decomposition[5]

• The Singular Value Decomposition (SVD) lies at the heart of several large scale data analyses: principal components analysis, collaborative filtering, image segmentation, among many others. [5]

• The SVD of a n*m matrix A is a decomposition such that U and V are both orthonormal And is a diagonal matrix with non-negative entries. [5]

• Example

Page 37: Microsoft  DryadLINQ

Singular Value Decomposition in DryadLINQ [5]

Page 38: Microsoft  DryadLINQ

Singular Value Decomposition in DryadLINQ [5]

Page 39: Microsoft  DryadLINQ

Singular Value Decomposition in DryadLINQ

• Figure source: [5]

Figure source: [5]

Page 40: Microsoft  DryadLINQ

Outline

• Dryad• LINQ• DryadLINQ• DryadLINQ in Machine Learning • K-means• Decision Tree• Singular Value Decomposition

• Strengths and Weaknesses

Page 41: Microsoft  DryadLINQ

Strengths & Weaknesses• DryadLINQ has the following features [8]:

• Declarative programming• Automatic parallelization• Integration with Visual Studio• Integration with .Net• Job graph optimizations• Conciseness

Page 42: Microsoft  DryadLINQ

Strengths & Weaknesses• While DryadLINQ is a great tool to program clusters, there is a

price to pay too for the convenience that it provides[5].

• Efficiency• managed code (C#) is not always as efficient as native code

• Debugging• the experience of debugging a cluster program remains more

painful than debugging a single-computer program• Transparency• In most cases one needs to have some understanding of the

operation of the compiler and particularly of the job execution plans generated to avoid egregious mistakes

Page 43: Microsoft  DryadLINQ

You can have it!• Dryad+DryadLINQ available for download• Academic license• Commercial evaluation license

• Runs on Windows HPC platform

Page 44: Microsoft  DryadLINQ

Conclusion

Figure source: [1]

Page 45: Microsoft  DryadLINQ

Thank you!

Page 46: Microsoft  DryadLINQ

Reference• [1] Microsoft Research webpage: Dryad and DryadLINQ for Data Intensive Research http://

research.microsoft.com/en-us/collaboration/tools/dryad.aspx• [2] DryadLINQ

: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level LanguageYuan Yu et al. Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008.

• [3] Microsoft Research webpage: Dryad http://research.microsoft.com/en-us/projects/dryad/• [4] Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis FetterlyEuropean Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

• [5] Large-Scale Machine Learning using DryadLINQ, chapter in Scaling Up Machine Learning, Frank McSherry, Yuan Yu, Mihai Budiu, Michael Isard, and Dennis Fetterly, Cambridge University Press, December 2011

• [6] DryadLINQ: A System for General-Purpose Distributed Data-Parallel ComputingPresentation by Yuan Yu at OSDI, December, 2008

• [7] Cluster Computing with DryadLINQPresentation by Mihai Budiu at Palo Alto Research Center CSL Colloquium, Palo Alto, CA May 8, 2008

• [8] Microsoft Research webpage: DryadLINQ http://research.microsoft.com/en-us/projects/dryadlinq/

• [9] A Machine-Learning toolking in DryadLINQPresentation slides in PowerPoint by Mihai Budiu and Kannan Achan.

Page 47: Microsoft  DryadLINQ

Parallel Runtimes – DryadLINQ vs. Hadoop

Page 48: Microsoft  DryadLINQ

Expirement by Indiana University Bloomington