monitoring and debugging dryad(linq) applications with daphne
DESCRIPTION
Monitoring and Debugging Dryad(LINQ) Applications with Daphne. Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) 2011. Programming Clusters: Marketing. - PowerPoint PPT PresentationTRANSCRIPT
Monitoring and Debugging Dryad(LINQ) Applications
with Daphne
Vilas Jagannath, Zuoning Yin, Mihai BudiuUniversity of Illinois, Microsoft Research SVC
International Workshop onHigh-Level Parallel Programming Models and
Supportive Environments (HIPS) 2011
Programming Clusters: Marketing
Map-Reduce
Programming Clusters: Reality
Complexity Exposed
Correctness or performance bugsbreak the single-system abstraction
Outline
• Motivation• Job structure• The Job Object Model• Tools for job understanding• Conclusions
Execution
Application
Data-Parallel Computation
6
Storage
Language
Map-Reduce
GFSBigTable
CosmosAzureHPC
Dryad
DryadLINQScope
Sawzall,FlumeJava
Hadoop
HDFSS3
Pig, Hive≈SQL LINQ, SQLSawzall, Java
7
2-D Piping• Unix Pipes: 1-D
grep | sed | sort | awk | perl
• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50
8
Dryad Job Structure
grep
sed
sortawk
perlgrep
grepsed
sort
sort
awk
Inputfiles
Vertices (processes)
Outputfiles
ChannelsStage
9
Dryad System Architecture
Networkjob schedule
data plane
control plane
NS,Sched Exec ExecExec
V V V
Job manager cluster
Fire
wal
l
How does it work in detail?
Cluster/Cloud
Cluster Scheduler
Job Manager(JM)
Exec
Storage
Localhost
Job Submission
Compiler
Application
IDE Vertex
Exec
Storage
Vertex
Exec
Storage
L: Logs, IO: Input/Output, R: Resources
L R IO L R IO L R IO
Logs – lots of them
• Job-related – Plan (xml), status, resources
• Job-manager– stdout.txt, stderr.txt, *.log
• Vertex– stdout.txt, *.log, *.xml, *.cmd
Monitoring Tools Structure
Cosm
os
Scop
e
HPC
v2
HPC
v3
Cluster abstraction
Job Object Model
Monitoring,Profiling,
Debugging
GUIs
Job Object Model
Logs
JOM
Views
JobVerticesPlan
Tools
Outline
• Motivation• Job structure• The Job Object Model• Tools for job understanding• Conclusions
The Job BrowserJob Stage Vertex
Job Schedule
Failure diagnosis
Diagnosis decision tree
• “Hand-made”• Least portable tool• Incomplete• High-coverage• Bug types:– User level– System-level– Cluster malfunction
Powershell = Interactive Queries
$cluster = get-cluster X $job = $cluster | select-AllJobs | sort-object Date | select-object -last 1 | select-DryadJob$failed = $job.Vertices | where-object { $_.State -eq "Failed" }
Vertex Debugging on Client
Vertex Profiling on Client
Debugging on Cluster
Collection<T> collection;var results = from c in collection
where c.name.length > 10 orderby c.age
select c.name;
where c.name.length > 10
Program Job
Breakpoint
Fire
wal
l
Cluster/Cloud
Storage
L R
Remote debugging
Cluster Scheduler
Job Manager(JM)
Localhost
Job Submission
DryadLINQ
Application
Visual Studio Vertex 1 Vertex 2
Breakpoint hit…
Breakpoint
L: Logs, IO: Input/Output, R: Resources
attach
Exec
Storage
Exec
Storage
Exec
L R IO L R IO IO
Fire
wal
l
Cluster/Cloud
Exec Exec
Storage Storage Storage
L L L
Notifications: Our Implementation
Cluster Scheduler
Job Manager(JM)
Localhost
Job Submission
DryadLINQ
Application
Visual Studio Vertex 1 Vertex 2
Daphne
L: Logs, IO: Input/Output, R: Resources
Exec
R IO R IO R IO
attach
Remote debugging
Open Problems
• What happens when 100,000 processes hit a breakpoint?
• How to evaluate expressions in the debugger when state is distributed?
• How to do large-scale performance debugging?• How to preserve map between distributed state
and original program state?• How much can the illusion of a
single system be preserved?
Conclusions
• Single-machine abstractions break down in the presence of (performance/correctness) bugs
• Job Object Model insulates tools from messy details
• Design the cluster runtime to make iteasy to build a JOM
• Rich interactive tools easily built on top of JOM• Much more work needed for debugging at scale