mumak
DESCRIPTION
Hong Tang talks about Using Simulation for Large-scale Distributed System Verification and DebuggingTRANSCRIPT
MumakUsing Simulation for Large-scale Distributed System
Verification and Debugging
Hong Tang2009.10 - Hadoop User Group
Outline Motivations Overview and Status Architecture Demo Lessons and Experiences Conclusions and Future Work
Motivations Large-scale distributed system is hard to verify and debug
- Cannot afford a 2000-node cluster for every developer, feature enhancement, and bug fix
- Time consuming to run benchmarks- Hard to reproduce production workload- Hard to reproduce corner case conditions
Motivations (cont.) JobTracker is a fertile area for experimentation
- Scheduling policies – we have four schedulers already- Synergy with HDFS block placement policies- Speculative execution policies- We want more people to help us innovate!
But, JobTracker is too complex to modify correctly- Many factors to consider: fairness, capacity/SLA guarantees, data
locality, load balance, failure handling and recovery, etc- Many control knobs in current implementation with subtle
interactions
Mumak Discrete-event simulation
- Can simulate a cluster with thousands of nodes in one process- Does not perform actual IO or computation- Virtual clock “spins” faster than wall clock- Can reproduce behavior/performance with degree of confidence
Plugging in the real JobTracker and Scheduler- No need to reimplement the scheduling policies- Inherit both features and bugs in JT and Scheduler
Simulate all conditions of a production cluster- Workload and cluster configuration generated by Rumen- Job submission, inter-arrival, dependencies, high-ram jobs, task exec- All kinds of failures and failure recovery logic- Resource contention
Project Status Work-in-progress First-cut version committed to Hadoop 0.21
- Basic framework- Simplified task execution- No modeling of resource utilization or contention- Only individual task failures, no node failures or failure correlations- No job dependencies, nor speculative execution
The Team- Core devs: Arun Murthy, Anirban Dasgupta, Tamas Sarlos, Guanying
Wang, Hong Tang- Collaborators: Dick King, Chris Douglas, Owen O’Malley
ClientProtocol
InterTrackerProtocol
Job TrackerSimulated Job Tracker
Sched
Task TrackerTask
TrackerTask Tracker
Simulated Job Client
Job ClientJob Client
Job Client
Architecture
Simulation Engine
RumenCluster Story
Job Story Trace
Job StoryCache
Simulated Task
Tracker
Simulated Task
Tracker
Simulated Task
Tracker
JobSubmissionEve
nt
HeartBeatEven
t
TaskAttempt
CompletionEve
nt
JobCompletionEve
nt
Job Finalization
Event Queue
DEMO Build hadoop-mapreduce
% ant package
Run with checked in traces% cd build/hadoop-0.22.0-dev% contrib/mumak/bin/mumak.sh \ src/contrib/mumak/src/test/data/19-jobs.trace.json.gz \ src/contrib/mumak/src/test/data/19-jobs.topology.json.gz
Implementation Experience JobTracker is reasonably modular and amenable to a
simulated environment- RPC, Clock, DNS-Switch mapping are all interfaces- No sleep() in main JT code
Usage of threads is localized and easy to factor out- Asynchronous job initialization: make them synchronous (AsjectJ)
Inheritance is necessary to extend/alter the behavior- JobTracker, JobInProgress, LaunchTaskAction, TaskTrackerStatus- Convey extra information: virtual time, task execution time, etc- Keep up with the base classes change may be hard
• Example: A new variable added to JobTracker
Make dependency explicit between map & reduce tasks
Mumak as a System Behavior Verifier
Mumak as a JobTracker Debugger MAPREDUCE-995: “JobHistory should handle cases where task completion
events are generated after job completion event”- Discovered when testing Mumak patch for 21 submission- Introduced by the MAPREDUCE-157, committed one day earlier- Manifested as JobTracker crash due to IOException
Root cause analysis- Developer made wrong assumption of the timing of events
• Assumed that when a job is marked as finished, no more heartbeat events related to the job would follow
- Lead to a Closable object being used after it is closed- To reproduce through benchmarking: need to inject a failed job and
encounter “good” timing when an outstanding task completes after the job is marked as failed
Mumak as a JobTracker Profiling Benchmark Memory allocation pattern similar to real JobTracker, but
at much faster rate Mumak overhead is less than 20-30% Limitations: Cannot detect synchronization hotspots or
sub-optimal IO or network operations Findings through YourKit profiling
- Wasteful String concatenations in Log.debug() statements in mapred.ResourceEstimator.getEstimatedTotalMapOutputSize
- Repetitive parsing of TaskTracker names to extract hostnames- Unnecessary exceptions from counter localization due to a
removed properties file (regression introduced by H-5717)
Conclusions Mumak: A light-weight, versatile tool for MapReduce
verification and debugging- Verification of overall system behavior- A debugger for JobTracker / scheduler- A micro-benchmark to stress CPU and memory allocation- Strengths:
• Easy to setup and run• Faster than running real benchmark: 1 min ~~ 2 hrs on a 2000-node cluster• Realistically reproduce conditions and test actual code• Can easily generate variants of ordering of distributed events
- Limitations: No simulation of system services or threads• Cannot debug synchronization problems among threads• Cannot reproduce OS-induced failures
What Next? Simulate more conditions
- Speculative execution- Resource contentions- Node failures- Job dependencies
Debug issues not resulting in hard-stop failures- Fairness violation, starvation, utilization problems
Patch validating, before and after comparison- Making sure the patch does what is supposed to do, and does not
introduce negative side effects Use Mumak to stage unit tests
- Construct testcases by building synthetic job stories
QUESTIONS?