dryad and dryadlinq for data-intensive research (and a bit about windows azure)

Download Dryad and DryadLINQ for data-intensive  research (and a bit about Windows Azure)

If you can't read please download the document

Upload: goldy

Post on 25-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Dryad and DryadLINQ for data-intensive research (and a bit about Windows Azure). Condor Week 2010, Madison, WI Christophe Poulain Microsoft Research. Core Computer Science. An Introduction. Microsoft External Research Tony Hey, Corporate VP - PowerPoint PPT Presentation

TRANSCRIPT

Dryad and DryadLINQfor data-intensive research(and a bit about Windows Azure)

Condor Week 2010, Madison, WIChristophe PoulainMicrosoft Research1Earth, Energy &EnvironmentCore Computer ScienceAn IntroductionMicrosoft External ResearchTony Hey, Corporate VPWork with the academic and research community to speed research, improve education, foster innovation, and improve lives around the world.Education & Scholarly CommunicationHealth & WellbeingAdvanced Research Tools and Serviceshttp://research.microsoft.com/collaboration/tools

One of our aim is to help scientists spend less time on IT issues and more time on discovery by creating open tools and services based on Microsoft platforms and productivity software4/15/2010 5:53 AM 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

2Windows Azure

It is Microsofts cloud services platform, designed to host web services and applications in Microsoft-owned, Internet-accessible data centers. 112 containers x 2000 servers = 224000 servers3A Spectrum of Application ModelsMicrosoft Azure.NET CLR/Windows OnlyChoice of LanguageSome Auto Failover/ Scale (but needs declarative application properties)Google App EngineTraditional Web AppsAuto Scaling and ProvisioningForce.ComSalesForce Biz AppsAuto Scaling and ProvisioningAmazon AWSVMs Look Like HardwareNo Limit on App ModelUser Must Implement Scalability and FailoverConstraints in the App Model More ConstrainedLess ConstrainedAutomated Management ServicesMore AutomationLess Automation5Windows Azure Compute and Storage ServicesCompute Services

Web Roles provide web service access to the app by the users. Web roles generate tasks for worker rolesWorker Roles do heavy lifting and manage data in tables/blobsCommunication is through queues. Roles are replicated as needed to provide massive scalability.

Storage Services

Blobs: large, unstructured data (blobs in containers) Tables: massive amounts of simply structured data (entity with properties) Queues: messages or requests, allowing web-roles and worker-roles to interact Drives: durable NTFS file system volumes sharable across instances SQL Azure for relational dataConsists of a (large) group of machines all managed by software called the fabric controllerThe fabric controller is replicated across a group of five to seven machines and owns all of the resources in the fabric Because it can communicate with a fabric agent on every computer, the controller is aware of every Windows Azure application running on the fabricIt manages Windows Azure services and hosted applications (load balancing, backup, replication, failover, scale up/down...)

The Azure Fabric

6ImmediateA credit card vs an e-mail to get started. No waiting.PersistentMost applications run continuouslyOther applications are just fineScalableMeet demands for compute and storage paying as you goScale according to needsThousands concurrent users served by one application.One application using thousands of cores.7Characteristics of a Windows Azure Application8Emergence of a Fourth Research ParadigmThousand years ago Experimental ScienceDescription of natural phenomena

Last few hundred years Theoretical ScienceNewtons Laws, Maxwells Equations

Last few decades Computational ScienceSimulation of complex phenomena

Today Data-Intensive ScienceScientists overwhelmed with data setsfrom many different sources Data captured by instrumentsData generated by simulationsData generated by sensor networkseScience is the set of tools and technologiesto support data federation and collaborationFor analysis and data miningFor data visualization and explorationFor scholarly communication and dissemination

Science must move from data to information to knowledgeWith thanks to Tony HeyWith thanks to Jim GrayResearch programming models for writing distributed data-parallel applications that scale from a small cluster to a large data-center.

9Dryad and DryadLINQ

A DryadLINQ programmer can use thousands of machines, each of them with multiple processors or cores, without prior knowledge in parallel programming.

Download available at http://connect.microsoft.com/DryadLINQ

DryadContinuously deployed since 2006Running on >> 104 machinesSifting through > 10Pb data dailyRuns on clusters > 3000 machinesHandles jobs with > 105 processes eachPlatform for rich software ecosystemUsed by >> 100 developers

Written at Microsoft Research, Silicon Valley10

Dryad Job Structuregrepsedsortawkperlgrepgrepsedsortsortawk

InputfilesVertices (processes)OutputfilesChannelsStage

Channel is a finite streams of items NTFS files (temporary) TCP pipes (inter-machine) Memory FIFOs (intra-machine)

11

Dryad System ArchitectureFiles, TCP, FIFO, Networkjob scheduledata planecontrol planeNSPDPDPDVVVJob managercluster1213Dryad PropertiesProvides a general, clean execution layerDataflow graph as the computation modelHigher language layer supplies graph, vertex code, channel types, hints for data locality, Automatically handles executionDistributes code, routes dataSchedules processes on machines near dataMasks failures in cluster and networkBut programming Dryad is not easy

DryadLINQ ExperienceSequential, single machine programming abstractionSame program runs on single-core, multi-core, or clusterFamiliar programming languagesFamiliar development environment

14LINQMicrosofts Language INtegrated QueryReleased with .NET Framework 3.5, Visual Studio optionalA set of operators to manipulate datasets in .NETSupport traditional relational operatorsSelect, Join, GroupBy, Aggregate, etc.Integrated into .NET programming languagesPrograms can call operatorsOperators can invoke arbitrary .NET functionsData modelData elements are strongly typed .NET objectsMuch more expressive than SQL tablesExtremely extensibleAdd new custom operatorsAdd new execution providers1516Example of LINQ QueryIEnumerable logs = GetLogLines();var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

16DryadLINQ Data Model17

PartitionPartitionedTable.Net objectsPartitionedTable implements IQueryable and IEnumerable17A complete DryadLINQ programpublic class LogEntry { public string user; public string ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; }}

public class UserPageCount{ public string user; public string page; public int count; public UserPageCount( string user, string page, int count) { this.user = user; this.page = page; this.count = count; }}PartitionedTable logs = PartitionedTable.Get( @file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\logfile.pt);var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToPartitionedTable( @file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\results.pt);18181920

Collection collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

21DryadLINQ ProviderC#collectionresults

C#C#C#VertexcodeQueryplan(Dryad job)Data21Map-Reduce in DryadLINQ22public static IQueryable MapReduce( this IQueryable input,Func mapper,Func keySelector,Func reducer) { var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result;}Map-Reduce Plan23MRGMQG1RDMSG2RstaticdynamicXXMQG1RDMSG2RXMQG1RDMSG2RXMQG1RDMQG1RDMSG2RXMQG1RDMSG2RXMQG1RDMSG2RMSG2Rmapsortgroupbyreducedistributemergesortgroupbyreducemergesortgroupbyreduceconsumermappartial aggregationreduceSSSSAAASSTdynamic24Project Natal: Learning From Data

Motion Capture(ground truth)

ClassifierTraining examplesMachine learningRasterize

Project Natal is using DryadLINQ to train enormous decision trees from millions of images across hundreds of cores.Recognize players from depth map at frame rate using fraction of Xbox CPU.25Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. UW Tech Report. UW-CSE-09-06-01. June 2009.Large-scale spatial clustering916M particles in 3D clustered under 70 minutes with 8 nodes.Re-implemented using DryadLINQPartition > Local cluster > Merge cluster > RelabelFaster development and good scalability25CAP3 - DNA Sequence Assembly Program [1]IQueryable inputFiles=PartitionedTable.Get(uri);IQueryable = inputFiles.Select(x=>ExecuteCAP3(x.line));[1] X. Huang, A. Madan, CAP3: A DNA Sequence Assembly Program, Genome Research, vol. 9, no. 9, 1999.

EST (Expressed Sequence Tag) corresponds to messenger RNAs (mRNAs) transcribed from the genes residing on chromosomes. Each individual EST sequence represents a fragment of mRNA, and the EST assembly aims to re-construct full-length mRNA sequences for each expressed gene. V

V

Input files (FASTA)Output files\\GCB-K18-N01\DryadData\cap3\cluster34442.fsa\\GCB-K18-N01\DryadData\cap3\cluster34443.fsa...\\GCB-K18-N01\DryadData\cap3\cluster34467.fsa\DryadData\cap3\cap3data100,344,CGB-K18-N011,344,CGB-K18-N019,344,CGB-K18-N01

Cap3data.00000000

Input files (FASTA)

Cap3data.pfCAP3 - PerformanceDryadLINQ for Scientific Analyses, Jaliya Ekanayake, Thilina Gunarathnea, Geoffrey Fox, Atilla Soner Balkir, Christophe Poulain, Nelson Araujo, Roger Barga (IEEE eScience 09)

High Energy Physics Data AnalysisHistogramming of events from a large (up to 1TB) data setData analysis requires ROOT framework (ROOT Interpreted Scripts)Performance depends on disk access speedsHadoop implementation uses a shared parallel file system (Lustre)ROOT scripts cannot access data from HDFSOn demand data movement has significant overheadDryad stores data in local disks giving better performance over Hadoop

Windows Azure is Microsofts cloud platformhttp://www.microsoft.com/windowsazure/

To Condor, this looks like one more resource to harness29Parting Thought #1Parting Thought #2DryadLINQ provides a powerful, elegant programming environment for large-scale data-parallel computing.

Is this of interest to the Condor community?

30AcknowledgementsMSR Silicon Valley Dryad & DryadLINQ teams Andrew Birrell, Mihai Budiu, Jon Currey, Ulfar Erlingsson, Dennis Fetterly, Michael Isard, Pradeep Kunda, Mark Manasse, Chandu Thekkath and Yuan Yu .

http://research.microsoft.com/en-us/projects/dryadhttp://research.microsoft.com/en-us/projects/dryadlinq

MSR External Research Advanced Research Tools and Services Teamhttp://research.microsoft.com/en-us/collaboration/tools/dryad.aspx

MS Product Groups: HPC, Parallel Computing Platform.

Academic CollaboratorsJaliya Ekanayake, Geoffrey Fox, Thilina Gunarathne, Scott Beason, Xiaohong Qiu (Indiana University Bloomington). YongChul Kwon, Magdalena Balazinska (University of Washington). Atilla Soner Balkir, Ian Foster (University of Chicago).

31Dryad/DryadLINQ PapersDryad: Distributed Data-Parallel Programs from Sequential Building Blocks (EuroSys07)DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (OSDI08)Hunting for prolems with Artemis (Usenix WASL, San Diego 08)Distributed Data-Parallel Computing Using a High-Level Programming Language (SIGMOD09)Quincy: Fair scheduling for distributed computing clusters (SOSP09)Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations (SOSP09)DryadInc: Reusing work in large scale computation (HotCloud 09).3233Extra SlidesImageProcessingCosmos DFSSQL ServersSoftware Stack34Windows ServerCluster Services (Azure, HPC, or Cosmos)Azure StorageDryadDryadLINQWindows ServerWindows ServerWindows ServerOther LanguagesCIFS/NTFSMachineLearningGraphAnalysisDataMiningApplicationsOther Applications34Design Space35ThroughputLatencyInternetPrivatedatacenterData-parallelSharedmemoryDryadSearchHPCGridTransaction3536Debugging and Diagnostics

Artemis