the dryad ecosystem

51
The Dryad ecosystem Rebecca Isaacs Microsoft Research Silicon Valley

Upload: basil-quinn

Post on 30-Dec-2015

34 views

Category:

Documents


1 download

DESCRIPTION

The Dryad ecosystem. Rebecca Isaacs Microsoft Research Silicon Valley. Outline. Introduction Dryad DryadLINQ Vertex performance prediction S cheduling for heterogeneous clusters Causal models for performance debugging Support software Quincy scheduler TidyFS distributed filesystem - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Dryad ecosystem

The Dryad ecosystem

Rebecca IsaacsMicrosoft Research

Silicon Valley

Page 2: The Dryad ecosystem

Outline

• Introduction– Dryad– DryadLINQ

• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging

• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management

• Current status

Page 3: The Dryad ecosystem

Data-parallel programming

• Partition large data sets and process the pieces in parallel

• Programming frameworks have made this easy– The execution environment (eg Dryad, Hadoop)

deals with scheduling of tasks, movement of data, and fault tolerance

– A high level language (eg DryadLINQ, PigLatin) allows the programmer to express the parallelism in a declarative fashion

Page 4: The Dryad ecosystem

Dryad (Isard et al, EuroSys 07)

• Generalized MapReduce• Programs are dataflow graphs (dags)– Vertices (nodes) connected by channels (edges)– Channels are implemented as shared-memory

FIFOs, TCP streams or files• The scheduler dispatches vertices onto

machines to run the program

Page 5: The Dryad ecosystem

Dryad components

JM D D D

Control plane

V V V

Files, FIFO, Network

Data planeJob schedule

Page 6: The Dryad ecosystem

Dryad computations

X

M

X X X X X

M M M

Input files

R R R R Stage

Vertices (processes)

Channels

Output files

M M

Page 7: The Dryad ecosystem

DryadLINQ (Yu et al, OSDI 08)

• LINQ is a set of .NET constructs for programming with datasets– Relational databases, XML, ...– Supported by new language features in C#, Visual Basic, F#– Lazy evaluation on the data source

• DryadLINQ extends LINQ with– Partitioned datasets– Some additional operators– Compilation of LINQ expressions into data-parallel

operations expressed as a Dryad dataflow graph

Page 8: The Dryad ecosystem

DryadLINQ example

• Join: find the lines in a file that start with one of the keywords in a second file

DryadTable<LineRecord> table = ddc.GetPartitionedTable<LineRecord>(mydata);DryadTable<LineRecord> keywords = ddc.GetTable<LineRecord>(keys);IQueryable<LineRecord> matches = table.Join(keywords, l1 => l1.line.Split(' ').First(), /* first key */ l2 => l2.line, /* second key */ (l1, l2) => l1); /* keep first line */

Page 9: The Dryad ecosystem

Dryad execution graph for join

Work is distributed: each word is sent to a machine based on its hash function

Data file has 2 partitions Keys file has

1 partition

2 partitions for the

output file

Page 10: The Dryad ecosystem

Outline

• Introduction– Dryad– DryadLINQ

• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging

• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management

• Current status

Joint work with Paul Barham and Richard Black, MSR Cambridge/Silicon ValleySimon Peter and Timothy Roscoe, ETH Zurich

Page 11: The Dryad ecosystem

How do vertices behave?

• Use Performance Analyzer tool from Microsoft (search for xperf on MSDN)

• Detail view of one vertex– “Select” operation– Reads and writes 1 million 1K records on local disk

Page 12: The Dryad ecosystem

Select vertex, version 1

• Hardware:– 2 quad-core processors, 2.66GHz– 2 disks configured as a striped volume

Page 13: The Dryad ecosystem

Disk2 utilization

Disk1 utilization

CPU utilization

Reads (Red)

Writes (Blue)

Page 14: The Dryad ecosystem

Select vertex, version 2

• Hardware:– 1 quad-core processor, 2GHz– 1 disk

Page 15: The Dryad ecosystem

Disk utilization CPU

utilization

Reads (Red)

Writes (Blue)

Page 16: The Dryad ecosystem

Data is read and then written in

batches

Reader thread

Other threads pick up the IO completions, sometimes

issuing writes

NB Processors are 95% idle during execution of this vertex

View of thread activity

Page 17: The Dryad ecosystem

Observations

• The bottleneck resource changes every few seconds• And may not be 100% utilized

• Vertices are multi-threaded, consuming multiple resources simultaneously

• Dryad is engineered for throughput– Sequential I/O– Batched in 256KB chunks– Requests are pipelined, typically 4+ deep

• Most DryadLINQ vertices are standard data processing operators with predictable behaviour

Page 18: The Dryad ecosystem

Factors affecting vertex execution times

• Hardware:– CPU speed– Number of CPUs– Disk transfer rate– Network transfer rate

• Workload:– I/O size– We assume file access patterns stay the same

• Placement relative to parent(s):– Channels can be local (read from or write to disk)– Or remote (read from remote or local file via SMB)

Page 19: The Dryad ecosystem

Key idea: identify vertex phases

• Trace a reference execution of the vertex• Identify phases within which resource

demands are consistent• Phase boundaries are when the resource

demands change– E.g. start reading, stop reading, etc.

• Similar phases, in terms of resource consumption are grouped together

Page 20: The Dryad ecosystem

Phases in the Select vertex

Page 21: The Dryad ecosystem

Phases in the Select vertex

Dcpu =70msDdisk = 30ms

Dcpu = 20msDdisk = 40ms

Dcpu =40ms

Page 22: The Dryad ecosystem

Predicting phase runtimes

• Each phase has the attributes:– Type: read, write, both, compute, overhead– “Concurrency histogram”– File being read/written– Number of bytes read/written

i.e. demands on each resource• Simple operational laws can be applied to

each phase individually– Can predict its runtime on different hardware

Page 23: The Dryad ecosystem

Expectations of accuracy

• Inherent variability in running times:– Layout of file on disk

• Inner or outer track• File fragmentation

– Background processes• Logging and scanning services

– Unanticipated network effects– Model deficiencies

• Memory contention• Caching• Garbage collection

• Prediction within 30% of actual would be good...

Page 24: The Dryad ecosystem

Prediction accuracy evaluation

Label Read Write CPU I/O Pred (s) Avg (s) %error

Reference 140 128 2.66*8 1.0 20.7 18.7 9.9

½ size I/O 140 128 2.66*8 0.5 10.3 10.1 1.9

Desktop 42 42 2.39*2 1.0 51.6 48.8 5.7

Remote 11.5 128 2.66*8 1.0 remote

18.1 29.6 38.9

File server 210 180 2.0*4 1.0 18.8 16.7 12.6

Laptop, remote 20 20 1.83 1.0 remote

79.9 91.9 13.1

Merge vertex, 1 input 1 output, average over 10 runsPredict running time on different hardware

Page 25: The Dryad ecosystem

Outline

• Introduction– Dryad– DryadLINQ

• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging

• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management

• Current status

Page 26: The Dryad ecosystem

The parallelism spectrum

shared memory

core1

disk

core2

coren

shared memory

multiprocessor

homogeneous clusters,

data centres

small, heterogeneous

clusters

Page 27: The Dryad ecosystem

Ad-hoc clusters

• Small, heterogeneous clusters are everywhere– In the workplace– In my house... and yours?

• Could be pretty useful for data-parallel programming– Data mining– Video editing– Scientific applications– ...

Page 28: The Dryad ecosystem

A data-parallel programming framework for ad-hoc clusters?

• Why?– Exploit unused machines with no hardcoded assumptions

about hardware and availability– “Easy” to write and run the code

• Why not?– Heterogeneity: the wrong schedule can make it go badly wrong – Built-in assumptions about failure don’t apply

• Our solution:1. Construct vertex performance models2. Apply a constraint-based search procedure to find a good

assignment of vertices to the physical computers

Page 29: The Dryad ecosystem

Default scheduling in Dryad

• DryadLINQ compiler creates an XML description of the vertices and how they are connected

• JobManager places the vertices on available nodes according to constraints specified in the XML file– Greedy scheduling approach – Programmer and/or DryadLINQ compiler can

provide hints

Page 30: The Dryad ecosystem

Heterogeneity can cause problems for greedy scheduling

Page 31: The Dryad ecosystem

Add a performance-aware planner to the end-to-end picture

Logging service on each node

Performance planner

Vertex phase analyser

CPU and I/O log

Vertex phase summary

XML graph

Updated XML graph

Page 32: The Dryad ecosystem

Planning algorithm

• Implemented with a constraint logic programming system (ECLiPSE)

• Constraints prune the search space• Heuristics reduce search time– Eg decide where to place longest running vertices

first• Greedy schedule gives upper bound

Page 33: The Dryad ecosystem

Contention between vertices

50 100 150 200

Hash(1)Hash(0)

Merge(1)Merge(0)

Join(1)

Join(0)

50 100 150 200

Hash(1)Hash(0)

Merge(1)

Merge(0)

Join(1)

Join(0)

Merge(1)

Merge(0)

Without Contentionmodel

With Contentionmodel

Page 34: The Dryad ecosystem

Workloads for experimental eval

Page 35: The Dryad ecosystem

Physical config of cluster

Machine Num CPUs

CPU (GHz)

RAM (GB)

Disk read (MBps)

Disk write (MBps)

Network (Mbps)

Laptop 1 1.8 2 20 20 1000

Desktop 2 2.4 2 42 42 1000

Server 4 2.0 4 210 180 1000

3 machines, quite heterogeneous

Page 36: The Dryad ecosystem

Overall speed-up vs greedy

Workload Greedy(s)min/med/max

Exhaustive (s) Achieved (s)min/med/max

Speedup (%)

Algebra 114/120/155 73 71/73/87 39

Join 165/171/206 115 114/123/144 28

Terasort 155 127 123/143/204 8

Page 37: The Dryad ecosystem

Outline

• Introduction– Dryad– DryadLINQ

• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging

• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management

• Current status

Page 38: The Dryad ecosystem

Edison

• New project– Position paper to appear at HotOS 11– Joint work with Moises Goldszmidt

• Performance problems in Dryad clusters– Resource contention– Data or computation skew– Hardware issues– Often transient

• Use active intervention– Re-run the vertex in a sandbox on the cluster– Construct experiments using its causal model– Systematically probe behavior: fix some variables while altering

others

Page 39: The Dryad ecosystem

Circuit blueprint

G4

G2

G3

G1A

B

• Given partial observations, lets us make inferences about the state of the circuit• If we intervene and

fix some inputs, lets us make inferences about the state of the circuit

Page 40: The Dryad ecosystem

Running time

Phase

CPU

Reading time Computing time

Data size

Rate data_in

Disk congestionNet congestion

Blueprint of a vertex

Blueprint for inferringState from bothObservations and Interventions Answer “what-if” questions

Root-cause analysis

Page 41: The Dryad ecosystem

Overview

• Introduction– Dryad– DryadLINQ

• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging

• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management

• Current status

Page 42: The Dryad ecosystem

Quincy (Isard et al, SOSP 09)

• Need to share the cluster between multiple, concurrently executing jobs

• Goals are fairness and data locality– If job x takes t seconds when it is run exclusively on the

cluster then x should take no more than Jt seconds when the cluster has J jobs

– Very large datasets stored on the cluster itself means that unnecessary data movement is costly

• These goals are conflicting– Optimal data locality => delay job until resources available– Fairness => allocate resources as soon as available

Page 43: The Dryad ecosystem

Quincy (cont)

• Strategies for fairness:– Sacrifice locality– Kill already running jobs– Admission control

• Fairness is achieved at a cost to throughput

Page 44: The Dryad ecosystem

Quincy (cont)

• Quantify every scheduling decision– Data transfer cost– Cost in wasted time if a task is killed

• Express scheduling problem as a flow network– Represent all worker tasks that are ready to run with preferred

locations, and all currently running tasks– Edge weights and capacities encode the scheduling policy– Produces a set of scheduling assignments for all jobs at once

that satisfy a global criterion• Solve online with standard min-cost flow algorithm

– Graph is updated whenever anything changes

Page 45: The Dryad ecosystem

TidyFS (Fetterly et al, USENIX 11)

• Simple distributed file system– Like HDFS or GFS

• Highly optimized to perform well for data-parallel computations:– Data streams are striped across cluster nodes– Stream parts are read or written in parallel

• By a single process

– I/O is sequential for high throughput– Streams are replaced rather than modified– In case of failure, missing parts of output streams can easily

be regenerated

Page 46: The Dryad ecosystem

TidyFS (cont)

• Data streams contain parts– Parts are replicated lazily

• Failure before replication is complete is handled by Dryad regenerating the missing part(s)

– Parts are “native” files, eg NTFS files or SQL Server database• Read and written using native APIs

• Centralized meta-data server– Replicated for fault tolerance– Replicas synchronize using Paxos

Page 47: The Dryad ecosystem

Artemis (Cretu-Ciocarlie et al, WASL 08)

• Management and analysis of Dryad logs– Each vertex produces around 1MB/s/process– A single Dryad job can easily produce >1TB of log data

• Runs and logs continuously on the cluster– To locate and collate log data for a particular job, itself runs a

DryadLINQ computation on the cluster• Combines job manager and vertex logs with over 80

Windows performance counters• Sophisticated GUI for post-processing and visualization

– Histograms and time series especially helpful for performance debugging

Page 48: The Dryad ecosystem

Nectar (Gunda et al, OSDI 10)

• Key idea: Data and the computation that generates it are interchangeable– Datasets are uniquely identified by the programs that produce

them• Automatic data management

– Cluster-wide caching service• Re-use of common datasets to save computation and space

– Garbage collection of obsolete datasets• Data can be transparently regenerated

• Nectar client service interposes on the DryadLINQ compiler– Consults Nectar cache server and rewrites the program

appropriately

Page 49: The Dryad ecosystem

Outline

• Introduction– Dryad– DryadLINQ

• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging

• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management

• Current status

Page 50: The Dryad ecosystem

Current status

• Imminent release of Dryad and DryadLINQ on Windows HPC– Uses HPC scheduler and other cluster services– Includes TidyFS as DSC (Distributed Storage Catalog)– Can download and try Technology Preview

• Ongoing research…– Naiad: allowing cycles in Dryad graphs

• Strongly connected components: more general programming model

• Loops and convergence tests without the need for driver programs• Continuous queries on streaming data

Page 51: The Dryad ecosystem

Conclusion

• 256-server cluster at MSR-SV– In continuous use by researchers and interns– Used information retrieval, machine learning, vision,

algorithms, planning, network trace analysis…– The mature components are in active use (Dryad,

DryadLINQ, Quincy, TidyFS, Artemis)• “Real users” drive the continuous enrichment of

the Dryad ecosystem– But many good research problems remain!