real-life experiences with grids: it’s not as easy as it looks
Post on 14-Jan-2016
23 Views
Preview:
DESCRIPTION
TRANSCRIPT
Grid Experiences 1
Real-life experiences with grids:It’s not as easy as it looks
Alain Royroy@cs.wisc.edu
University of Wisconsin-MadisonCondor Team
Grid Experiences 2
Who Am I?
• Member of Condor Team– Experience with Condor– Experience with grid deployment
• Developer of Virtual Data Toolkit– Used by GriPhyN, EDG, LCG…– Packaging of Globus, Condor, etc.
• Collaborator with INFN– Working with Paolo Mazzanti– In Bologna for four weeks
Grid Experiences 3
Italy• Italy is beautiful
• The food is wonderful• The people are friendly
Grid Experiences 4
Background• Condor’s environment is a little like a grid
– Not all computers (grid sites) are under Condor’s control
– Computers (grid sites) disappear at the owner’s whim
– Everything changes constantly
• Condor was built to deal with this dynamic environment
• Grid software needs to do the same
Grid Experiences 5
Background
• Late 1980s until today– Condor developed and deployed on
hundreds of sites– Condor built to deal with failures
• Recently– Condor-G: your window to the grid– Condor team has helped deploy grid
technology for real use—not just experiments
Grid Experiences 6
Background: Condor
• Condor is a batch job system• Goal: High throughput computing
– Different than high-performance
• Goal: High reliability• Goal: Support distributed
ownership
Grid Experiences 7
High-Throughput Computing
• Worry about FLOPS/year, not FLOPS/second
• Use all resources effectively– Dedicated clusters– Non-dedicated computers (desktop)
Grid Experiences 8
Effective Resource Use
• Requires high reliability Computers come and go, your jobs
shouldn’t. – Checkpointing– Be prepared for everything breaking
• Requires distributed ownership
Grid Experiences 9
Condor-G
• Condor-G submits Globus jobs• Jobs are in persistent queue
– Unlike globus-job-run
• Jobs are retried on system failures• Jobs are held on some failures• Condor-G makes it easy to submit
grid jobs
Grid Experiences 10
Background: USCMS• CMS:
– Detector online in 2007– Needs to simulate & reconstruct millions
of events
• USCMS testbed– Joint PPDG/GriPhyN effort– Integrate CMS tools with grid tools
• Globus• Condor-G
– Contribute real work to CMS
Grid Experiences 11
Background: USCMS
• 7 sites, 250+ CPUs• Spring 2002: Deploy & test• Fall 2002
– Last minute production– 150,000 events in two weeks– Successful, but lots of work
• Today:– Wider deployment & use
Grid Experiences 12
Background: DØ
• Experiment at Fermilab• Already doing real production, real
analysis• Deploying on grid sites today
– Condor-G– Globus– SAM
Grid Experiences 13
DØ: Condor-G
• They liked Condor-G:• Condor-G missing a feature:
– Deciding which grid-site to use
• SAM (data handling software) knows where data is located
• SAMGrid: – Condor-G asks SAM for advice– Condor-G decides where to run jobs
Grid Experiences 14
DØ: deployment
• Spring: Beginning of deployment• Late summer: production• Early results:
– It looks good– We have more work to do
• Better error reporting• Better matchmaking
• What will we learn later?
Grid Experiences 15
Problems & Lessons
• During our experiences, we’ve:– Encountered many problems– Developed solutions to these problems– Learned many lessons about grids
• This talk:– Shares some interesting problems– Gives some advice & solutions
Grid Experiences 16
Taking a taxi
• How do you take a taxi in Paestum, Italy?– We don’t need to: walk 4km there– The ruins were lovely– The ruins were outside– It was about 35°C– Wife is pregnant
Grid Experiences 17
Use all your resources
• Walk up to storekeeper• Ask: Dovay Ooon Taxi? (Dove un
taxi?)• Be patient: Wait ten minutes• Take taxi• I assumed my resources (local
knowledge, Italian) were insufficient, but they saved me time when I used them
Grid Experiences 18
Use all your resources• Condor:
– Uses dedicated machines (I can walk)– Uses non-dedicated machines (I can
sometimes ask for help)
• Grids:– Connect your machine rooms– Can you take advantage of other
resources?– Avoid mentality “I must control all
resources”, and you will prosper
Grid Experiences 19
Grid: distributed machine room?
• You can have good control• You can pre-install applications• You know how everything works
BUT…• You lose flexibility
– How quickly can you upgrade sites?– Did they install everything correctly?– Can you use new grid sites easily?
Grid Experiences 20
Grid: Use all resources
• Assume: basic grid software is installed
• Assume: nothing else is installed– Bring your software with you
• Submit one job: install software• Submit N jobs: use software
– You control software– You ensure correct installation
• Easy to use any grid site
Grid Experiences 21
• Long-running programs crash– Condor has daemons on each machine:
• User (job) agent• Machine agent• Matchmaker
– They crash:• Programming errors• Network failures• Disk failures• …
Long-running programs
Grid Experiences 22
Watch programs
• Condor master– Small program, rarely changed– Runs Condor daemons– When daemon crashes:
• Restart daemon, send email• If it crashes again, restart after backoff
• Result:– Many errors are silently fixed– Yet we don’t just ignore crashes
Grid Experiences 23
Short-running programs
• Short-running programs crash/hang• Example: globus-url-copy
– USCMS testbed: staging data– Some fraction of copies hang or fail– Programming error + delicate network– Hard to reproduce and fix
Grid Experiences 24
Watch programs
• When copy exceeds timeout, kill and retry
• Possible to do in shell scripting languages, but not easy
• Use Fault Tolerant Shell to watch programs
Grid Experiences 25
Fault Tolerant Shell
• Shell language built for coping with errors
try for 30 minuteswget http://www.example.com/file.tar.gz
gunzip file.tar.gz
tar xf file.tar
endExponential backoff on failure: Wait {1, 2, 4…} seconds * rand in [1,2]
Grid Experiences 26
FTSH: exponential backoff
• Why exponential backoff?– What if 100 ftsh scripts are executing?– Avoid synchronization reduce load,
increase chance of success– Similar to Ethernet
Grid Experiences 27
Fault Tolerant Shell
• Easier to cope with failures:try 5 times
wget http://www.example.com/file.tar.gz
catch
rm –f file.tar.gz
failure
endCleanup partially downloaded file, if it exists
Grid Experiences 28
Fault Tolerant Shell
• Flexibletry for 30 minutes
try for 5 minutes
wget http://example.com/file.tar.gz
end
try for 1 minute or 3 times
gunzip file.tar.gz
tar xf file.tar
catch
rm –rf file.tar
end
end
Cope with network failure
Cope with disk failure
Grid Experiences 29
FTSH: More information
• Work of Doug Thain– thain@cs.wisc.edu
• Excellent paper: – The Ethernet Approach to Grid
Computing, by Doug Thain – Available from:
http://www.cs.wisc.edu/~thain
• Even if you don’t use FTSH, read this paper!
Grid Experiences 30
Whose error is it?
• The source of an error is not always obvious
• The source of an error influences how you react to the error
• Example: Java universe in Condor
Grid Experiences 31
Java Universe• Users submit Java jobs to Condor• Whose error is it? Check result code:
– 1: Program dereferenced NULL pointer
– 1: Job’s image is corrupt
– 1: VM doesn’t have enough memory to run program
– 1: Java installation is misconfigured
Job shouldn’t run again
Job shouldn’t run again
Try another machine with more memory
Don’t use this machine for Java
Grid Experiences 32
Don’t trust configuration
• Users tells Condor: “Java is installed”– This is just a hint!
• Condor verifies Java configuration– Run simple job, verify output
• If Java works, Condor advertises that Java can be used
• If Java fails, error is reported, Java can’t be used
Grid Experiences 33
Look for error scope
• Add Java wrapper to all Java jobs– Run program– Examine return code/exception– Write all details to file
• Examine output of wrapper, or exception from JVM– We know if job is bad– We know if JVM is insufficient for job– We know if JVM is bad
Grid Experiences 34
Error Scope
• We could have an entire talk on error scope
• Excellent paper: Error Scope on a Computational Grid: Theory and Practice, by Doug Thain
• Useful paper even if you don’t use Condor or Java
Grid Experiences 35
Many layers in a grid
condor_submit Condor job agent
Condor matchmaker Execution computer
Condor-G job agent
condor_submit
Globus jobmanager
Globus gatekeeper
Globus GRAM
inetd
Grid Experiences 36
We forgot inetd
• We submitted 300 jobs at once• Inetd noticed many connections
per second• Inetd presumed there was a denial
of service attack and refused connections for five minutes
• Lots of debugging!
Grid Experiences 37
There are more layers!
Master Site
Impala
MOP
Condor-G
Worker
Globus
Batch System(Condor, PBS)
Real WorkDAGMan
USCMS Testbed Architecture (A bit dated)
Grid Experiences 38
More layers than that!
1. MCRunJob2. Impala3. MOP4. condor_schedd5. DAGMan6. Condor-G condor_schedd7. condor_gridmanager8. gahp_server
9. globus-gatekeeper10. globus-job-manager11. globus-job-manager-script.pl12. local batch system submit13. local batch system execute14. MOP wrapper15. Impala wrapper16. actual job
This disregards inetd, network, file servers, file transfers…
USCMS Testbed Architecture (A bit dated)
Grid Experiences 39
Recovery at multiple levels
• Fault-tolerance and recovery is built in at many levels:– Condor_master: restart daemons– Condor_schedd: job queue– DAGMan: checkpoint DAG of jobs– Gahp_server: isolate Globus libraries– And others…
Grid Experiences 40
Allocate debugging time
• Allocate lots of debugging time• It is very hard to propagate errors• How does a user find a remote error?
– Call system administrator– Admin looks through log files for each
layer (not accessible to user)
• We need better debugging methods
Grid Experiences 41
Everything will fail(Everything)
• In the USCMS testbed production:– Power outage for several hours– Network outages: few minutes-11 hr.– Failed configuration change– Site upgraded– Jobs accidentally removed– Software bugs everywhere
Grid Experiences 42
How do you cope?
• Condor-G:– Error: job cannot run. This is not good
enough– Resubmit jobs that can be resubmitted,
perhaps after a delay– Put jobs on hold in queue:
• User examines hold reason (proxy is expired)• User fixes error• User restarts job
Grid Experiences 43
Everything will fail(Even the little things)
• Condor Matchmaker:– Collects descriptions of machines & jobs– Soft state in matchmaker (push smarts
to edge, like Internet)
• UDP packets to advertise machines– Less overhead than many TCP
connections– Works great in a LAN
• But…
Grid Experiences 44
Everything will fail: UDP• But you lose some UDP packets
– Send packets every five minutes– Keep stale information for 15 minutes– Be prepared to cope with stale
information– This has worked for years in Condor
• DØ: matchmaking on grid– UDP packets from Korea to Chicago were
completely lost on weekdays– Added TCP option
Grid Experiences 45
Be prepared
• Assume everything will fail– Have recovery at multiple levels– Understand scope of errors– Don’t trust configuration:
• Verify it• Install & configure software “on the fly”
• Assume bugs are everywhere• Build software to cope with errors
top related