monitoring and debugging dryad(linq) applications with daphne vilas jagannath, zuoning yin, mihai...

27
Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) 2011

Post on 19-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Monitoring and Debugging Dryad(LINQ) Applications

with Daphne

Vilas Jagannath, Zuoning Yin, Mihai BudiuUniversity of Illinois, Microsoft Research SVC

International Workshop onHigh-Level Parallel Programming Models and

Supportive Environments (HIPS) 2011

Page 2: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Programming Clusters: Marketing

Map-Reduce

Page 3: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Programming Clusters: Reality

Page 4: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Complexity Exposed

Correctness or performance bugsbreak the single-system abstraction

Page 5: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Outline

• Motivation• Job structure• The Job Object Model• Tools for job understanding• Conclusions

Page 6: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Execution

Application

Data-Parallel Computation

6

Storage

Language

Map-Reduce

GFSBigTable

CosmosAzureHPC

Dryad

DryadLINQScope

Sawzall,FlumeJava

Hadoop

HDFSS3

Pig, Hive≈SQL LINQ, SQLSawzall, Java

Page 7: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

7

2-D Piping• Unix Pipes: 1-D

grep | sed | sort | awk | perl

• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

Page 8: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

8

Dryad Job Structure

grep

sed

sortawk

perlgrep

grepsed

sort

sort

awk

Inputfiles

Vertices (processes)

Outputfiles

ChannelsStage

Page 9: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

9

Dryad System Architecture

Networkjob schedule

data plane

control plane

NS,Sched Exec ExecExec

V V V

Job manager cluster

Page 10: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Fire

wal

l

How does it work in detail?

Cluster/Cloud

Cluster Scheduler

Job Manager(JM)

Exec

Storage

Localhost

Job Submission

Compiler

Application

IDE Vertex

Exec

Storage

Vertex

Exec

Storage

L: Logs, IO: Input/Output, R: Resources

L R IO L R IO L R IO

Page 11: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Logs – lots of them

• Job-related – Plan (xml), status, resources

• Job-manager– stdout.txt, stderr.txt, *.log

• Vertex– stdout.txt, *.log, *.xml, *.cmd

Page 12: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Monitoring Tools Structure

Cosm

os

Scop

e

HPC

v2

HPC

v3

Cluster abstraction

Job Object Model

Monitoring,Profiling,

Debugging

GUIs

Page 13: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Job Object Model

Logs

JOM

Views

JobVerticesPlan

Tools

Page 14: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Outline

• Motivation• Job structure• The Job Object Model• Tools for job understanding• Conclusions

Page 15: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

The Job BrowserJob Stage Vertex

Page 16: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Job Schedule

Page 17: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Failure diagnosis

Page 18: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Diagnosis decision tree

• “Hand-made”• Least portable tool• Incomplete• High-coverage• Bug types:– User level– System-level– Cluster malfunction

Page 19: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Powershell = Interactive Queries

$cluster = get-cluster X $job = $cluster | select-AllJobs | sort-object Date | select-object -last 1 | select-DryadJob$failed = $job.Vertices | where-object { $_.State -eq "Failed" }

Page 20: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Vertex Debugging on Client

Page 21: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Vertex Profiling on Client

Page 22: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Debugging on Cluster

Collection<T> collection;var results = from c in collection

where c.name.length > 10 orderby c.age

select c.name;

where c.name.length > 10

Program Job

Breakpoint

Page 23: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Fire

wal

l

Cluster/Cloud

Storage

L R

Remote debugging

Cluster Scheduler

Job Manager(JM)

Localhost

Job Submission

DryadLINQ

Application

Visual Studio Vertex 1 Vertex 2

Breakpoint hit…

Breakpoint

L: Logs, IO: Input/Output, R: Resources

attach

Exec

Storage

Exec

Storage

Exec

L R IO L R IO IO

Page 24: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Fire

wal

l

Cluster/Cloud

Exec Exec

Storage Storage Storage

L L L

Notifications: Our Implementation

Cluster Scheduler

Job Manager(JM)

Localhost

Job Submission

DryadLINQ

Application

Visual Studio Vertex 1 Vertex 2

Daphne

L: Logs, IO: Input/Output, R: Resources

Exec

R IO R IO R IO

attach

Page 25: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Remote debugging

Page 26: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Open Problems

• What happens when 100,000 processes hit a breakpoint?

• How to evaluate expressions in the debugger when state is distributed?

• How to do large-scale performance debugging?• How to preserve map between distributed state

and original program state?• How much can the illusion of a

single system be preserved?

Page 27: Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC

Conclusions

• Single-machine abstractions break down in the presence of (performance/correctness) bugs

• Job Object Model insulates tools from messy details

• Design the cluster runtime to make iteasy to build a JOM

• Rich interactive tools easily built on top of JOM• Much more work needed for debugging at scale