the dryad ecosystem

The Dryad ecosystem

Rebecca IsaacsMicrosoft Research

Silicon Valley

Outline

• Introduction– Dryad– DryadLINQ

• Vertex performance prediction– Scheduling for heterogeneous clusters– Causal models for performance debugging

• Support software– Quincy scheduler– TidyFS distributed filesystem– Artemis monitoring system– Nectar data management

• Current status

Data-parallel programming

• Partition large data sets and process the pieces in parallel

• Programming frameworks have made this easy– The execution environment (eg Dryad, Hadoop)

deals with scheduling of tasks, movement of data, and fault tolerance

– A high level language (eg DryadLINQ, PigLatin) allows the programmer to express the parallelism in a declarative fashion

Dryad (Isard et al, EuroSys 07)

• Generalized MapReduce• Programs are dataflow graphs (dags)– Vertices (nodes) connected by channels (edges)– Channels are implemented as shared-memory

FIFOs, TCP streams or files• The scheduler dispatches vertices onto

machines to run the program

Dryad components

JM D D D

Control plane

V V V

Files, FIFO, Network

Data planeJob schedule

Dryad computations

X

M

X X X X X

M M M

Input files

R R R R Stage

Vertices (processes)

Channels

Output files

M M

DryadLINQ (Yu et al, OSDI 08)

• LINQ is a set of .NET constructs for programming with datasets– Relational databases, XML, ...– Supported by new language features in C#, Visual Basic, F#– Lazy evaluation on the data source

• DryadLINQ extends LINQ with– Partitioned datasets– Some additional operators– Compilation of LINQ expressions into data-parallel

operations expressed as a Dryad dataflow graph

DryadLINQ example

• Join: find the lines in a file that start with one of the keywords in a second file

DryadTable<LineRecord> table = ddc.GetPartitionedTable<LineRecord>(mydata);DryadTable<LineRecord> keywords = ddc.GetTable<LineRecord>(keys);IQueryable<LineRecord> matches = table.Join(keywords, l1 => l1.line.Split(' ').First(), /* first key */ l2 => l2.line, /* second key */ (l1, l2) => l1); /* keep first line */

Dryad execution graph for join

Work is distributed: each word is sent to a machine based on its hash function

Data file has 2 partitions Keys file has

1 partition

2 partitions for the

output file

Outline




• Current status

Joint work with Paul Barham and Richard Black, MSR Cambridge/Silicon ValleySimon Peter and Timothy Roscoe, ETH Zurich

How do vertices behave?

• Use Performance Analyzer tool from Microsoft (search for xperf on MSDN)

• Detail view of one vertex– “Select” operation– Reads and writes 1 million 1K records on local disk

Select vertex, version 1

• Hardware:– 2 quad-core processors, 2.66GHz– 2 disks configured as a striped volume

Disk2 utilization

Disk1 utilization

CPU utilization

Reads (Red)

Writes (Blue)

Select vertex, version 2

• Hardware:– 1 quad-core processor, 2GHz– 1 disk

Disk utilization CPU

utilization

Reads (Red)

Writes (Blue)

Data is read and then written in

batches

Reader thread

Other threads pick up the IO completions, sometimes

issuing writes

NB Processors are 95% idle during execution of this vertex

View of thread activity

Observations

• The bottleneck resource changes every few seconds• And may not be 100% utilized

• Vertices are multi-threaded, consuming multiple resources simultaneously

• Dryad is engineered for throughput– Sequential I/O– Batched in 256KB chunks– Requests are pipelined, typically 4+ deep

• Most DryadLINQ vertices are standard data processing operators with predictable behaviour

Factors affecting vertex execution times

• Hardware:– CPU speed– Number of CPUs– Disk transfer rate– Network transfer rate

• Workload:– I/O size– We assume file access patterns stay the same

• Placement relative to parent(s):– Channels can be local (read from or write to disk)– Or remote (read from remote or local file via SMB)

Key idea: identify vertex phases

• Trace a reference execution of the vertex• Identify phases within which resource

demands are consistent• Phase boundaries are when the resource

demands change– E.g. start reading, stop reading, etc.

• Similar phases, in terms of resource consumption are grouped together

Phases in the Select vertex

Phases in the Select vertex

Dcpu =70msDdisk = 30ms

Dcpu = 20msDdisk = 40ms

Dcpu =40ms

Predicting phase runtimes

• Each phase has the attributes:– Type: read, write, both, compute, overhead– “Concurrency histogram”– File being read/written– Number of bytes read/written

i.e. demands on each resource• Simple operational laws can be applied to

each phase individually– Can predict its runtime on different hardware

Expectations of accuracy

• Inherent variability in running times:– Layout of file on disk

• Inner or outer track• File fragmentation

– Background processes• Logging and scanning services

– Unanticipated network effects– Model deficiencies

• Memory contention• Caching• Garbage collection

• Prediction within 30% of actual would be good...

Prediction accuracy evaluation

Label Read Write CPU I/O Pred (s) Avg (s) %error

Reference 140 128 2.66*8 1.0 20.7 18.7 9.9

½ size I/O 140 128 2.66*8 0.5 10.3 10.1 1.9

Desktop 42 42 2.39*2 1.0 51.6 48.8 5.7

Remote 11.5 128 2.66*8 1.0 remote

18.1 29.6 38.9

File server 210 180 2.0*4 1.0 18.8 16.7 12.6

Laptop, remote 20 20 1.83 1.0 remote

79.9 91.9 13.1

Merge vertex, 1 input 1 output, average over 10 runsPredict running time on different hardware

Outline




• Current status

The parallelism spectrum

shared memory

core1

disk

core2

coren

shared memory

multiprocessor

homogeneous clusters,

data centres

small, heterogeneous

clusters

Ad-hoc clusters

• Small, heterogeneous clusters are everywhere– In the workplace– In my house... and yours?

• Could be pretty useful for data-parallel programming– Data mining– Video editing– Scientific applications– ...

A data-parallel programming framework for ad-hoc clusters?

• Why?– Exploit unused machines with no hardcoded assumptions

about hardware and availability– “Easy” to write and run the code

• Why not?– Heterogeneity: the wrong schedule can make it go badly wrong – Built-in assumptions about failure don’t apply

• Our solution:1. Construct vertex performance models2. Apply a constraint-based search procedure to find a good

assignment of vertices to the physical computers

Default scheduling in Dryad

• DryadLINQ compiler creates an XML description of the vertices and how they are connected

• JobManager places the vertices on available nodes according to constraints specified in the XML file– Greedy scheduling approach – Programmer and/or DryadLINQ compiler can

provide hints

Heterogeneity can cause problems for greedy scheduling

Add a performance-aware planner to the end-to-end picture

Logging service on each node

Performance planner

Vertex phase analyser

CPU and I/O log

Vertex phase summary

XML graph

Updated XML graph

Planning algorithm

• Implemented with a constraint logic programming system (ECLiPSE)

• Constraints prune the search space• Heuristics reduce search time– Eg decide where to place longest running vertices

first• Greedy schedule gives upper bound

Contention between vertices

50 100 150 200

Hash(1)Hash(0)

Merge(1)Merge(0)

Join(1)

Join(0)

50 100 150 200

Hash(1)Hash(0)

Merge(1)

Merge(0)

Join(1)

Join(0)

Merge(1)

Merge(0)

Without Contentionmodel

With Contentionmodel

Workloads for experimental eval

Physical config of cluster

Machine Num CPUs

CPU (GHz)

RAM (GB)

Disk read (MBps)

Disk write (MBps)

Network (Mbps)

Laptop 1 1.8 2 20 20 1000

Desktop 2 2.4 2 42 42 1000

Server 4 2.0 4 210 180 1000

3 machines, quite heterogeneous

Overall speed-up vs greedy

Workload Greedy(s)min/med/max

Exhaustive (s) Achieved (s)min/med/max

Speedup (%)

Algebra 114/120/155 73 71/73/87 39

Join 165/171/206 115 114/123/144 28

Terasort 155 127 123/143/204 8

Outline




• Current status

Edison

• New project– Position paper to appear at HotOS 11– Joint work with Moises Goldszmidt

• Performance problems in Dryad clusters– Resource contention– Data or computation skew– Hardware issues– Often transient

• Use active intervention– Re-run the vertex in a sandbox on the cluster– Construct experiments using its causal model– Systematically probe behavior: fix some variables while altering

others

Circuit blueprint

G4

G2

G3

G1A

B

• Given partial observations, lets us make inferences about the state of the circuit• If we intervene and

fix some inputs, lets us make inferences about the state of the circuit

Running time

Phase

CPU

Reading time Computing time

Data size

Rate data_in

Disk congestionNet congestion

Blueprint of a vertex

Blueprint for inferringState from bothObservations and Interventions Answer “what-if” questions

Root-cause analysis

Overview




• Current status

Quincy (Isard et al, SOSP 09)

• Need to share the cluster between multiple, concurrently executing jobs

• Goals are fairness and data locality– If job x takes t seconds when it is run exclusively on the

cluster then x should take no more than Jt seconds when the cluster has J jobs

– Very large datasets stored on the cluster itself means that unnecessary data movement is costly

• These goals are conflicting– Optimal data locality => delay job until resources available– Fairness => allocate resources as soon as available

Quincy (cont)

• Strategies for fairness:– Sacrifice locality– Kill already running jobs– Admission control

• Fairness is achieved at a cost to throughput

Quincy (cont)

• Quantify every scheduling decision– Data transfer cost– Cost in wasted time if a task is killed

• Express scheduling problem as a flow network– Represent all worker tasks that are ready to run with preferred

locations, and all currently running tasks– Edge weights and capacities encode the scheduling policy– Produces a set of scheduling assignments for all jobs at once

that satisfy a global criterion• Solve online with standard min-cost flow algorithm

– Graph is updated whenever anything changes

TidyFS (Fetterly et al, USENIX 11)

• Simple distributed file system– Like HDFS or GFS

• Highly optimized to perform well for data-parallel computations:– Data streams are striped across cluster nodes– Stream parts are read or written in parallel

• By a single process

– I/O is sequential for high throughput– Streams are replaced rather than modified– In case of failure, missing parts of output streams can easily

be regenerated

TidyFS (cont)

• Data streams contain parts– Parts are replicated lazily

• Failure before replication is complete is handled by Dryad regenerating the missing part(s)

– Parts are “native” files, eg NTFS files or SQL Server database• Read and written using native APIs

• Centralized meta-data server– Replicated for fault tolerance– Replicas synchronize using Paxos

Artemis (Cretu-Ciocarlie et al, WASL 08)

• Management and analysis of Dryad logs– Each vertex produces around 1MB/s/process– A single Dryad job can easily produce >1TB of log data

• Runs and logs continuously on the cluster– To locate and collate log data for a particular job, itself runs a

DryadLINQ computation on the cluster• Combines job manager and vertex logs with over 80

Windows performance counters• Sophisticated GUI for post-processing and visualization

– Histograms and time series especially helpful for performance debugging

Nectar (Gunda et al, OSDI 10)

• Key idea: Data and the computation that generates it are interchangeable– Datasets are uniquely identified by the programs that produce

them• Automatic data management

– Cluster-wide caching service• Re-use of common datasets to save computation and space

– Garbage collection of obsolete datasets• Data can be transparently regenerated

• Nectar client service interposes on the DryadLINQ compiler– Consults Nectar cache server and rewrites the program

appropriately

Outline




• Current status

Current status

• Imminent release of Dryad and DryadLINQ on Windows HPC– Uses HPC scheduler and other cluster services– Includes TidyFS as DSC (Distributed Storage Catalog)– Can download and try Technology Preview

• Ongoing research…– Naiad: allowing cycles in Dryad graphs

• Strongly connected components: more general programming model

• Loops and convergence tests without the need for driver programs• Continuous queries on streaming data

Conclusion

• 256-server cluster at MSR-SV– In continuous use by researchers and interns– Used information retrieval, machine learning, vision,

algorithms, planning, network trace analysis…– The mature components are in active use (Dryad,

DryadLINQ, Quincy, TidyFS, Artemis)• “Real users” drive the continuous enrichment of

the Dryad ecosystem– But many good research problems remain!

the dryad ecosystem

Documents