mumak

MumakUsing Simulation for Large-scale Distributed System

Verification and Debugging

Hong Tang2009.10 - Hadoop User Group

Outline Motivations Overview and Status Architecture Demo Lessons and Experiences Conclusions and Future Work

Motivations Large-scale distributed system is hard to verify and debug

- Cannot afford a 2000-node cluster for every developer, feature enhancement, and bug fix

- Time consuming to run benchmarks- Hard to reproduce production workload- Hard to reproduce corner case conditions

Motivations (cont.) JobTracker is a fertile area for experimentation

- Scheduling policies – we have four schedulers already- Synergy with HDFS block placement policies- Speculative execution policies- We want more people to help us innovate!

But, JobTracker is too complex to modify correctly- Many factors to consider: fairness, capacity/SLA guarantees, data

locality, load balance, failure handling and recovery, etc- Many control knobs in current implementation with subtle

interactions

Mumak Discrete-event simulation

- Can simulate a cluster with thousands of nodes in one process- Does not perform actual IO or computation- Virtual clock “spins” faster than wall clock- Can reproduce behavior/performance with degree of confidence

Plugging in the real JobTracker and Scheduler- No need to reimplement the scheduling policies- Inherit both features and bugs in JT and Scheduler

Simulate all conditions of a production cluster- Workload and cluster configuration generated by Rumen- Job submission, inter-arrival, dependencies, high-ram jobs, task exec- All kinds of failures and failure recovery logic- Resource contention

Project Status Work-in-progress First-cut version committed to Hadoop 0.21

- Basic framework- Simplified task execution- No modeling of resource utilization or contention- Only individual task failures, no node failures or failure correlations- No job dependencies, nor speculative execution

The Team- Core devs: Arun Murthy, Anirban Dasgupta, Tamas Sarlos, Guanying

Wang, Hong Tang- Collaborators: Dick King, Chris Douglas, Owen O’Malley

ClientProtocol

InterTrackerProtocol

Job TrackerSimulated Job Tracker

Sched

Task TrackerTask

TrackerTask Tracker

Simulated Job Client

Job ClientJob Client

Job Client

Architecture

Simulation Engine

RumenCluster Story

Job Story Trace

Job StoryCache

Simulated Task

Tracker

Simulated Task

Tracker

Simulated Task

Tracker

JobSubmissionEve

nt

HeartBeatEven

t

TaskAttempt

CompletionEve

nt

JobCompletionEve

nt

Job Finalization

Event Queue

DEMO Build hadoop-mapreduce

% ant package

Run with checked in traces% cd build/hadoop-0.22.0-dev% contrib/mumak/bin/mumak.sh \ src/contrib/mumak/src/test/data/19-jobs.trace.json.gz \ src/contrib/mumak/src/test/data/19-jobs.topology.json.gz

Implementation Experience JobTracker is reasonably modular and amenable to a

simulated environment- RPC, Clock, DNS-Switch mapping are all interfaces- No sleep() in main JT code

Usage of threads is localized and easy to factor out- Asynchronous job initialization: make them synchronous (AsjectJ)

Inheritance is necessary to extend/alter the behavior- JobTracker, JobInProgress, LaunchTaskAction, TaskTrackerStatus- Convey extra information: virtual time, task execution time, etc- Keep up with the base classes change may be hard

• Example: A new variable added to JobTracker

Make dependency explicit between map & reduce tasks

Mumak as a System Behavior Verifier

Mumak as a JobTracker Debugger MAPREDUCE-995: “JobHistory should handle cases where task completion

events are generated after job completion event”- Discovered when testing Mumak patch for 21 submission- Introduced by the MAPREDUCE-157, committed one day earlier- Manifested as JobTracker crash due to IOException

Root cause analysis- Developer made wrong assumption of the timing of events

• Assumed that when a job is marked as finished, no more heartbeat events related to the job would follow

- Lead to a Closable object being used after it is closed- To reproduce through benchmarking: need to inject a failed job and

encounter “good” timing when an outstanding task completes after the job is marked as failed

Mumak as a JobTracker Profiling Benchmark Memory allocation pattern similar to real JobTracker, but

at much faster rate Mumak overhead is less than 20-30% Limitations: Cannot detect synchronization hotspots or

sub-optimal IO or network operations Findings through YourKit profiling

- Wasteful String concatenations in Log.debug() statements in mapred.ResourceEstimator.getEstimatedTotalMapOutputSize

- Repetitive parsing of TaskTracker names to extract hostnames- Unnecessary exceptions from counter localization due to a

removed properties file (regression introduced by H-5717)

Conclusions Mumak: A light-weight, versatile tool for MapReduce

verification and debugging- Verification of overall system behavior- A debugger for JobTracker / scheduler- A micro-benchmark to stress CPU and memory allocation- Strengths:

• Easy to setup and run• Faster than running real benchmark: 1 min ~~ 2 hrs on a 2000-node cluster• Realistically reproduce conditions and test actual code• Can easily generate variants of ordering of distributed events

- Limitations: No simulation of system services or threads• Cannot debug synchronization problems among threads• Cannot reproduce OS-induced failures

What Next? Simulate more conditions

- Speculative execution- Resource contentions- Node failures- Job dependencies

Debug issues not resulting in hard-stop failures- Fairness violation, starvation, utilization problems

Patch validating, before and after comparison- Making sure the patch does what is supposed to do, and does not

introduce negative side effects Use Mumak to stage unit tests

- Construct testcases by building synthetic job stories

QUESTIONS?

mumak

Technology

job dependencies

failed job

job completion event

jobtracker scheduler

behavior jobtracker

rumen job submission

real jobtracker

task execution time