microsoft dryadlinq

Post on 23-Feb-2016

70 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Microsoft DryadLINQ. -- Jinling Li. What’s DryadLINQ ?. A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language . [1] Data-Parallel Computing ( large data example ) High-Level Language DryadLINQ = Dryad+LINQ. Dryad. LINQ. DryadLINQ = Dryad+LINQ. - PowerPoint PPT Presentation

TRANSCRIPT

Microsoft DryadLINQ--Jinling Li

What’s DryadLINQ?• A System for General-Purpose Distributed Data-Parallel

Computing Using a High-Level Language.[1]

• Data-Parallel Computing (large data example)

• High-Level Language

• DryadLINQ=Dryad+LINQ

Dryad

LINQ

DryadLINQ=Dryad+LINQ

Figure source: [1]

Outline

• Dryad• LINQ• DryadLINQ• DryadLINQ in Machine Learning • Strengths and Weaknesses

Dryad• Microsoft Dryad is a high-performance, general-purpose

distributed computing engine that handles some of the most difficult aspects of cluster-based distributed computing. [2]

• Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. [2]

• A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.[2]

Dryad System Architecture• The job manager contains the application-specific code to construct the

job’s communication graph along with library code to schedule the work across the available resources. [2]

• The name server is used to enumerate all the available computers. The name server also exposes the position of each computer within the network topology . [2]

Figure source:[2]

Dryad System Architecture• The job manager (JM) consults the name server (NS) to discover the list of

available computers. It maintains the job graph and schedules running vertices (V) as computers become available using the daemon (D) as a proxy. [2]

• The first time a vertex (V) is executed on a computer its binary is sent from the job manager to the daemon and subsequently it is executed from a cache. [2]

Figure source: [2]

Dryad Computational Model• The basic computational model for Dryad is the directed-acyclic

graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. [3]

Figure source: [3]

Software Stack• Dryad is mostly used as middleware below a high-level language

layer and low-level internal cluster infrastructure. [3]

Figure source:[3]

Below Dryad

• Below Dryad is a cluster-management system that supports some low-level actions like starting a process on a remote computer, and one or more distributed storage systems that support partitioned files. [4]

• Most of the Dryad development has been done on top of a Microsoft internal cluster infrastructure called Cosmos that was developed by the Bing product group. [4]

Outline

• Dryad• Introduction• Computational Model• System Architecture• Software Stack

• LINQ• DryadLINQ• DryadLINQ in Machine Learning • Strengths and Weaknesses

LINQ• LINQ=Language Integrated Queries• A Microsoft .NET Framework component that adds native data

querying capabilities to .NET languages• Comprises a set of operators to manipulate collections of .Net

objects and bridges the gap between the world of objects and the world of data. [5]

Figure source: [5]

LINQ• LINQ adds high level declarative data manipulation to many of

the .NET programming languages, including C#, Visual Basic and F#. [5]

• LINQ datasets are .NET collections. Technically, a .NET collection of values of type T is a data type which implements the predefined interface IEnumerable<T>[5]

• Another interface IQueryable<T> represents a query (i.e., a computation) that can produce a collection with elements of type T. [5]

LINQ operatorsOperation Example Result

Where C.Where(x=>x>3) (4,5)

Select C.Select(x=>x+1) (2,3,4,5,6)

Aggregate C.Aggregate((x,y)=>x+y) 15

GroupBy C.GroupBy(x=>x%2) ((1,3,5),(2,4))

OrderBy C.OrdrBy(x=>-x) (5,4,3,2,1)

SelectMany C.SelectMany(x=>Factors(x)) (1,1,2,1,3,1,2,4,1,5)

Join C.Join(C, x=>x, x=>x-4, (x,y)=>x+y) (6)

Examples using LINQ operators on collection C={1,2,3,4,5}. Factors is a user defined function. Table source: [5]

A Simple Example: Word Count• Count word frequency in a set of documents [6]

• var documents = GetDocuments();• var words = documents.SelectMany (document => document.Words);• var groups = words.GroupBy(word=>word);• var counts = groups.Select• (group => new WordCount(group.Key, group.Count()));

Outline

• Dryad• LINQ• Introduction• Interfaces• Operators

• DryadLINQ• DryadLINQ in Machine Learning • Strengths and Weaknesses

DryadLINQ• DryadLINQ bridges the gap between Dryad and LINQ layer. • DryadLINQ translates programs written in LINQ into Dryad job

execution plans that can be executed on a cluster by Dryad, and transparently returns the results to the host application[5].

Figure source: [5]

Example: LINQ operators

Figure source: [5]

2D Piping• The Dryad job execution plans generated by DryadLINQ are composable: the

output of one graph can become the input of another one. In fact, this is exactly how complex LINQ queries are translated: each operator is translated to a graph independently, and the graphs are then concatenated. [7]

Figure source: [7]

A Simple Example: Word Count• Count word frequency in a set of documents[6]:

• var documents = GetDocuments();• var words = documents.SelectMany (document => document.Words);• var groups = words.GroupBy(word=>word);• var counts = groups.Select• (group => new WordCount(group.Key, group.Count()));

Word Count in DryadLINQ• Count word frequency in a set of documents[6]:

• var documents = DryadLinq.GetTable<Document>• (“file://docs.txt”);• var words = documents.SelectMany• (document => document.Words);• var groups = words.GroupBy(word=>word);• var counts = groups.Select• (group => new WordCount(group.Key, group.Count()));

Distributed Execution of Word Count

Figure source: [6]

Another Example: extract Ulfar’s favorite web pages from many web log files

DryadLINQ Execution Overview

Figure source: [2]

DryadLINQ=Dryad+LINQ

Figure source: [8]

Outline

• Dryad• LINQ• DryadLINQ• Introduction• DryadLINQ = Compiles LINQ to Dryad• LINQ operators and other examples

• DryadLINQ in Machine Learning • Strengths and Weaknesses

DryadLINQ in Machine Learning

Real-life Application: XBox

Figure source: [9]

Example: k-means[5]

• K-means in LINQ

K-means in DryadLINQ[5]

• How to implement the GroupBy operation at the heart of the k-means aggregation?

• DryadLINQ generates a job execution plan that uses two-level aggregation: each computer builds local groups with the local data and only sends the aggregated information about these groups to the next stage; the next stage computes the actual centroid. [5]

K-means in DryadLINQ

Figure source: [5]

Example: Decision Tree[5]

• Represent a decision tree with a dictionary that maps tree node indices (integer values) to attribute indices in the attribute array

• The most common algorithm to induce a decision tree starts from an empty tree and a set of records with class labels and attributes with values.

• The algorithm repeatedly extends the tree by grouping records by their current location under the partial tree, and for each such group determining the attribute resulting in the greatest reduction in conditional entropy.

• records.GroupBy(record => TreeWalk(record, tree)) .Select(group => FindBestAttribute(group));

Decision Tree Induction in DryadLINQ[5]

Decision Tree Induction in DryadLINQ

• Each iteration through the loop invokes a query returning the list of attribute indices that are best for each of the leaves in the old tree. the tree variable is updated on the client computer, and retransmitted to the cluster by DryadLINQ with each iteration. [5]

Decision Tree Induction in DryadLINQ[5]

Figure source: [5]

Example: Singular Value Decomposition[5]

• The Singular Value Decomposition (SVD) lies at the heart of several large scale data analyses: principal components analysis, collaborative filtering, image segmentation, among many others. [5]

• The SVD of a n*m matrix A is a decomposition such that U and V are both orthonormal And is a diagonal matrix with non-negative entries. [5]

• Example

Singular Value Decomposition in DryadLINQ [5]

Singular Value Decomposition in DryadLINQ [5]

Singular Value Decomposition in DryadLINQ

• Figure source: [5]

Figure source: [5]

Outline

• Dryad• LINQ• DryadLINQ• DryadLINQ in Machine Learning • K-means• Decision Tree• Singular Value Decomposition

• Strengths and Weaknesses

Strengths & Weaknesses• DryadLINQ has the following features [8]:

• Declarative programming• Automatic parallelization• Integration with Visual Studio• Integration with .Net• Job graph optimizations• Conciseness

Strengths & Weaknesses• While DryadLINQ is a great tool to program clusters, there is a

price to pay too for the convenience that it provides[5].

• Efficiency• managed code (C#) is not always as efficient as native code

• Debugging• the experience of debugging a cluster program remains more

painful than debugging a single-computer program• Transparency• In most cases one needs to have some understanding of the

operation of the compiler and particularly of the job execution plans generated to avoid egregious mistakes

You can have it!• Dryad+DryadLINQ available for download• Academic license• Commercial evaluation license

• Runs on Windows HPC platform

Conclusion

Figure source: [1]

Thank you!

Reference• [1] Microsoft Research webpage: Dryad and DryadLINQ for Data Intensive Research http://

research.microsoft.com/en-us/collaboration/tools/dryad.aspx• [2] DryadLINQ

: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level LanguageYuan Yu et al. Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008.

• [3] Microsoft Research webpage: Dryad http://research.microsoft.com/en-us/projects/dryad/• [4] Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis FetterlyEuropean Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

• [5] Large-Scale Machine Learning using DryadLINQ, chapter in Scaling Up Machine Learning, Frank McSherry, Yuan Yu, Mihai Budiu, Michael Isard, and Dennis Fetterly, Cambridge University Press, December 2011

• [6] DryadLINQ: A System for General-Purpose Distributed Data-Parallel ComputingPresentation by Yuan Yu at OSDI, December, 2008

• [7] Cluster Computing with DryadLINQPresentation by Mihai Budiu at Palo Alto Research Center CSL Colloquium, Palo Alto, CA May 8, 2008

• [8] Microsoft Research webpage: DryadLINQ http://research.microsoft.com/en-us/projects/dryadlinq/

• [9] A Machine-Learning toolking in DryadLINQPresentation slides in PowerPoint by Mihai Budiu and Kannan Achan.

Parallel Runtimes – DryadLINQ vs. Hadoop

Expirement by Indiana University Bloomington

top related