real-life experiences with grids: it’s not as easy as it looks

Post on 14-Jan-2016

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Real-life experiences with grids: It’s not as easy as it looks. Alain Roy roy@cs.wisc.edu University of Wisconsin-Madison Condor Team. Who Am I?. Member of Condor Team Experience with Condor Experience with grid deployment Developer of Virtual Data Toolkit Used by GriPhyN, EDG, LCG… - PowerPoint PPT Presentation

TRANSCRIPT

Grid Experiences 1

Real-life experiences with grids:It’s not as easy as it looks

Alain Royroy@cs.wisc.edu

University of Wisconsin-MadisonCondor Team

Grid Experiences 2

Who Am I?

• Member of Condor Team– Experience with Condor– Experience with grid deployment

• Developer of Virtual Data Toolkit– Used by GriPhyN, EDG, LCG…– Packaging of Globus, Condor, etc.

• Collaborator with INFN– Working with Paolo Mazzanti– In Bologna for four weeks

Grid Experiences 3

Italy• Italy is beautiful

• The food is wonderful• The people are friendly

Grid Experiences 4

Background• Condor’s environment is a little like a grid

– Not all computers (grid sites) are under Condor’s control

– Computers (grid sites) disappear at the owner’s whim

– Everything changes constantly

• Condor was built to deal with this dynamic environment

• Grid software needs to do the same

Grid Experiences 5

Background

• Late 1980s until today– Condor developed and deployed on

hundreds of sites– Condor built to deal with failures

• Recently– Condor-G: your window to the grid– Condor team has helped deploy grid

technology for real use—not just experiments

Grid Experiences 6

Background: Condor

• Condor is a batch job system• Goal: High throughput computing

– Different than high-performance

• Goal: High reliability• Goal: Support distributed

ownership

Grid Experiences 7

High-Throughput Computing

• Worry about FLOPS/year, not FLOPS/second

• Use all resources effectively– Dedicated clusters– Non-dedicated computers (desktop)

Grid Experiences 8

Effective Resource Use

• Requires high reliability Computers come and go, your jobs

shouldn’t. – Checkpointing– Be prepared for everything breaking

• Requires distributed ownership

Grid Experiences 9

Condor-G

• Condor-G submits Globus jobs• Jobs are in persistent queue

– Unlike globus-job-run

• Jobs are retried on system failures• Jobs are held on some failures• Condor-G makes it easy to submit

grid jobs

Grid Experiences 10

Background: USCMS• CMS:

– Detector online in 2007– Needs to simulate & reconstruct millions

of events

• USCMS testbed– Joint PPDG/GriPhyN effort– Integrate CMS tools with grid tools

• Globus• Condor-G

– Contribute real work to CMS

Grid Experiences 11

Background: USCMS

• 7 sites, 250+ CPUs• Spring 2002: Deploy & test• Fall 2002

– Last minute production– 150,000 events in two weeks– Successful, but lots of work

• Today:– Wider deployment & use

Grid Experiences 12

Background: DØ

• Experiment at Fermilab• Already doing real production, real

analysis• Deploying on grid sites today

– Condor-G– Globus– SAM

Grid Experiences 13

DØ: Condor-G

• They liked Condor-G:• Condor-G missing a feature:

– Deciding which grid-site to use

• SAM (data handling software) knows where data is located

• SAMGrid: – Condor-G asks SAM for advice– Condor-G decides where to run jobs

Grid Experiences 14

DØ: deployment

• Spring: Beginning of deployment• Late summer: production• Early results:

– It looks good– We have more work to do

• Better error reporting• Better matchmaking

• What will we learn later?

Grid Experiences 15

Problems & Lessons

• During our experiences, we’ve:– Encountered many problems– Developed solutions to these problems– Learned many lessons about grids

• This talk:– Shares some interesting problems– Gives some advice & solutions

Grid Experiences 16

Taking a taxi

• How do you take a taxi in Paestum, Italy?– We don’t need to: walk 4km there– The ruins were lovely– The ruins were outside– It was about 35°C– Wife is pregnant

Grid Experiences 17

Use all your resources

• Walk up to storekeeper• Ask: Dovay Ooon Taxi? (Dove un

taxi?)• Be patient: Wait ten minutes• Take taxi• I assumed my resources (local

knowledge, Italian) were insufficient, but they saved me time when I used them

Grid Experiences 18

Use all your resources• Condor:

– Uses dedicated machines (I can walk)– Uses non-dedicated machines (I can

sometimes ask for help)

• Grids:– Connect your machine rooms– Can you take advantage of other

resources?– Avoid mentality “I must control all

resources”, and you will prosper

Grid Experiences 19

Grid: distributed machine room?

• You can have good control• You can pre-install applications• You know how everything works

BUT…• You lose flexibility

– How quickly can you upgrade sites?– Did they install everything correctly?– Can you use new grid sites easily?

Grid Experiences 20

Grid: Use all resources

• Assume: basic grid software is installed

• Assume: nothing else is installed– Bring your software with you

• Submit one job: install software• Submit N jobs: use software

– You control software– You ensure correct installation

• Easy to use any grid site

Grid Experiences 21

• Long-running programs crash– Condor has daemons on each machine:

• User (job) agent• Machine agent• Matchmaker

– They crash:• Programming errors• Network failures• Disk failures• …

Long-running programs

Grid Experiences 22

Watch programs

• Condor master– Small program, rarely changed– Runs Condor daemons– When daemon crashes:

• Restart daemon, send email• If it crashes again, restart after backoff

• Result:– Many errors are silently fixed– Yet we don’t just ignore crashes

Grid Experiences 23

Short-running programs

• Short-running programs crash/hang• Example: globus-url-copy

– USCMS testbed: staging data– Some fraction of copies hang or fail– Programming error + delicate network– Hard to reproduce and fix

Grid Experiences 24

Watch programs

• When copy exceeds timeout, kill and retry

• Possible to do in shell scripting languages, but not easy

• Use Fault Tolerant Shell to watch programs

Grid Experiences 25

Fault Tolerant Shell

• Shell language built for coping with errors

try for 30 minuteswget http://www.example.com/file.tar.gz

gunzip file.tar.gz

tar xf file.tar

endExponential backoff on failure: Wait {1, 2, 4…} seconds * rand in [1,2]

Grid Experiences 26

FTSH: exponential backoff

• Why exponential backoff?– What if 100 ftsh scripts are executing?– Avoid synchronization reduce load,

increase chance of success– Similar to Ethernet

Grid Experiences 27

Fault Tolerant Shell

• Easier to cope with failures:try 5 times

wget http://www.example.com/file.tar.gz

catch

rm –f file.tar.gz

failure

endCleanup partially downloaded file, if it exists

Grid Experiences 28

Fault Tolerant Shell

• Flexibletry for 30 minutes

try for 5 minutes

wget http://example.com/file.tar.gz

end

try for 1 minute or 3 times

gunzip file.tar.gz

tar xf file.tar

catch

rm –rf file.tar

end

end

Cope with network failure

Cope with disk failure

Grid Experiences 29

FTSH: More information

• Work of Doug Thain– thain@cs.wisc.edu

• Excellent paper: – The Ethernet Approach to Grid

Computing, by Doug Thain – Available from:

http://www.cs.wisc.edu/~thain

• Even if you don’t use FTSH, read this paper!

Grid Experiences 30

Whose error is it?

• The source of an error is not always obvious

• The source of an error influences how you react to the error

• Example: Java universe in Condor

Grid Experiences 31

Java Universe• Users submit Java jobs to Condor• Whose error is it? Check result code:

– 1: Program dereferenced NULL pointer

– 1: Job’s image is corrupt

– 1: VM doesn’t have enough memory to run program

– 1: Java installation is misconfigured

Job shouldn’t run again

Job shouldn’t run again

Try another machine with more memory

Don’t use this machine for Java

Grid Experiences 32

Don’t trust configuration

• Users tells Condor: “Java is installed”– This is just a hint!

• Condor verifies Java configuration– Run simple job, verify output

• If Java works, Condor advertises that Java can be used

• If Java fails, error is reported, Java can’t be used

Grid Experiences 33

Look for error scope

• Add Java wrapper to all Java jobs– Run program– Examine return code/exception– Write all details to file

• Examine output of wrapper, or exception from JVM– We know if job is bad– We know if JVM is insufficient for job– We know if JVM is bad

Grid Experiences 34

Error Scope

• We could have an entire talk on error scope

• Excellent paper: Error Scope on a Computational Grid: Theory and Practice, by Doug Thain

• Useful paper even if you don’t use Condor or Java

Grid Experiences 35

Many layers in a grid

condor_submit Condor job agent

Condor matchmaker Execution computer

Condor-G job agent

condor_submit

Globus jobmanager

Globus gatekeeper

Globus GRAM

inetd

Grid Experiences 36

We forgot inetd

• We submitted 300 jobs at once• Inetd noticed many connections

per second• Inetd presumed there was a denial

of service attack and refused connections for five minutes

• Lots of debugging!

Grid Experiences 37

There are more layers!

Master Site

Impala

MOP

Condor-G

Worker

Globus

Batch System(Condor, PBS)

Real WorkDAGMan

USCMS Testbed Architecture (A bit dated)

Grid Experiences 38

More layers than that!

1. MCRunJob2. Impala3. MOP4. condor_schedd5. DAGMan6. Condor-G condor_schedd7. condor_gridmanager8. gahp_server

9. globus-gatekeeper10. globus-job-manager11. globus-job-manager-script.pl12. local batch system submit13. local batch system execute14. MOP wrapper15. Impala wrapper16. actual job

This disregards inetd, network, file servers, file transfers…

USCMS Testbed Architecture (A bit dated)

Grid Experiences 39

Recovery at multiple levels

• Fault-tolerance and recovery is built in at many levels:– Condor_master: restart daemons– Condor_schedd: job queue– DAGMan: checkpoint DAG of jobs– Gahp_server: isolate Globus libraries– And others…

Grid Experiences 40

Allocate debugging time

• Allocate lots of debugging time• It is very hard to propagate errors• How does a user find a remote error?

– Call system administrator– Admin looks through log files for each

layer (not accessible to user)

• We need better debugging methods

Grid Experiences 41

Everything will fail(Everything)

• In the USCMS testbed production:– Power outage for several hours– Network outages: few minutes-11 hr.– Failed configuration change– Site upgraded– Jobs accidentally removed– Software bugs everywhere

Grid Experiences 42

How do you cope?

• Condor-G:– Error: job cannot run. This is not good

enough– Resubmit jobs that can be resubmitted,

perhaps after a delay– Put jobs on hold in queue:

• User examines hold reason (proxy is expired)• User fixes error• User restarts job

Grid Experiences 43

Everything will fail(Even the little things)

• Condor Matchmaker:– Collects descriptions of machines & jobs– Soft state in matchmaker (push smarts

to edge, like Internet)

• UDP packets to advertise machines– Less overhead than many TCP

connections– Works great in a LAN

• But…

Grid Experiences 44

Everything will fail: UDP• But you lose some UDP packets

– Send packets every five minutes– Keep stale information for 15 minutes– Be prepared to cope with stale

information– This has worked for years in Condor

• DØ: matchmaking on grid– UDP packets from Korea to Chicago were

completely lost on weekdays– Added TCP option

Grid Experiences 45

Be prepared

• Assume everything will fail– Have recovery at multiple levels– Understand scope of errors– Don’t trust configuration:

• Verify it• Install & configure software “on the fly”

• Assume bugs are everywhere• Build software to cope with errors

top related