condor by example

64
Douglas Thain Computer Sciences Department University of Wisconsin-Madison October 2000 [email protected] http://www.cs.wisc.edu/condor Condor by Example

Upload: yeo-butler

Post on 31-Dec-2015

17 views

Category:

Documents


1 download

DESCRIPTION

Condor by Example. Lecture Format:. In each lecture: Lecture to whole group. Workshop and examples at computer. Oops! Some items are filled in at the last minute. Please fill the _______ with notes. Outline. Overview Submitting Jobs, Getting Feedback Setting Requirements with ClassAds - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Condor by Example

Douglas ThainComputer Sciences DepartmentUniversity of Wisconsin-Madison

October 2000

[email protected]://www.cs.wisc.edu/condor

Condor by Example

Page 2: Condor by Example

www.cs.wisc.edu/condor

Lecture Format:

› In each lecture: Lecture to whole group. Workshop and examples at computer.

› Oops! Some items are filled in at the last

minute. Please fill the _______ with notes.

Page 3: Condor by Example

www.cs.wisc.edu/condor

Outline

› Overview

› Submitting Jobs, Getting Feedback

› Setting Requirements with ClassAds

› Which Universe?

› Move to Workshop

Page 4: Condor by Example

www.cs.wisc.edu/condor

What is Condor?

› Condor converts a collection of unrelated workstations into a high-throughput computing facility.

› Condor uses matchmaking to make sure that everyone is happy.

Page 5: Condor by Example

www.cs.wisc.edu/condor

What is High-Throughput Computing?

› High-performance: CPU cycles/second under ideal circumstances. “How fast can I run simulation X on this

machine?”

› High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. “How many times can I run simulation X in

the next week using all available machines?”

Page 6: Condor by Example

www.cs.wisc.edu/condor

What is High-Throughput Computing?

› Condor does whatever it takes to run your jobs, even if some machines… Crash! Are disconnected Run out of disk space Are removed or added from the pool Are put to other uses

Page 7: Condor by Example

www.cs.wisc.edu/condor

What is Matchmaking?

› Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners.

› Users (jobs) have constraints: “I need an Alpha with 256 MB RAM”

› Owners (machines) have constraints: “Only run jobs when I am away from my

desk and never run jobs owned by Bob.”

Page 8: Condor by Example

www.cs.wisc.edu/condor

Who uses Condor?

› Hundreds of universities and companies around the world!

› University of Wisconsin, USA 682 CPUs in one building Computer architecture simulations

› National Institute of Physics, Italy 200 CPUs in many cities Reconstruction of collider events

› And many others!

Page 9: Condor by Example

www.cs.wisc.edu/condor

What can Condordo for me?

Condor can…

› …increase your throughput.

› …do your housekeeping.

› …improve reliability.

› …give performance feedback.

Page 10: Condor by Example

www.cs.wisc.edu/condor

Cluster Overview

Server512 MB800 MHz

100 Mb/s network

20 GB

Client128 MB666 MHz

Client128 MB666 MHz

Client128 MB666 MHz

Client128 MB666 MHz

Client128 MB666 MHz

10 GB 10 GB 10 GB 10 GB 10 GB

Page 11: Condor by Example

www.cs.wisc.edu/condor

How many machines now?

› The map is out of date!

› The system is always changing.

› First example: What machines (and of what kind) are in the pool now?

Page 12: Condor by Example

www.cs.wisc.edu/condor

How Many Machines?% condor_statusName OpSys Arch State Activity LoadAv Mem

lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle 0.000 30axpd21.pd.inf OSF1 ALPHA Owner Idle 0.266 96vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy 0.000 256

. . . Machines Owner Claimed Unclaimed Matched Preempting

ALPHA/OSF1 115 67 46 1 0 1 INTEL/LINUX 53 18 0 35 0 0 INTEL/LINUX-GLIBC 16 7 0 9 0 0 SUN4u/SOLARIS251 1 1 0 0 0 0 SUN4u/SOLARIS26 6 2 0 4 0 0 SUN4u/SOLARIS27 1 1 0 0 0 0 SUN4x/SOLARIS26 2 1 0 1 0 0

Total 194 97 46 50 0 1

Page 13: Condor by Example

www.cs.wisc.edu/condor

Machine States› Most machines will be:

Owner:• The machine’s owner is busy at the

console, so no Condor jobs may run. Claimed:

• Condor has selected the machine to run jobs for other users.

Page 14: Condor by Example

www.cs.wisc.edu/condor

Machine States

› Only a few should be: Unclaimed:

• The owner is gone, but Condor has not yet selected the machine.

Matched:• Between claimed and unclaimed.

Preempting:• Condor is busy removing a job.

Page 15: Condor by Example

www.cs.wisc.edu/condor

More Things to Try

% condor_status -help% condor_status –avail% condor_status –run% condor_status –total% condor_status –pool

condor.cs.wisc.edu

Page 16: Condor by Example

www.cs.wisc.edu/condor

Submitting Jobs

Page 17: Condor by Example

www.cs.wisc.edu/condor

Steps to Running a Job

› Re-link for Condor.

› Submit the job.

› Watch the progess.

› Receive email when done.

Page 18: Condor by Example

www.cs.wisc.edu/condor

Example Job

Integrate sin(x) from 0 to 10, using 10 million slices.

Simple program takes a few seconds.

% ./integrate 10 100000002.0445075

Page 19: Condor by Example

www.cs.wisc.edu/condor

PROGRAM INTEGRATECHARACTER STR*10REAL X, SLICES, LIMIT

CALL GETARG(1,STR)READ (STR,*) LIMITCALL GETARG(2,STR)READ (STR,*) SLICES

TOTAL=0STEP=LIMIT/SLICES

DO X=0, LIMIT, STEPTOTAL = TOTAL + SIN(X)*STEP

END DO

PRINT *, TOTAL

END

Page 20: Condor by Example

www.cs.wisc.edu/condor

Re-link for Condor

› If you normally compile like this: g77 integrate.f -o integrate

› Then compile for Condor like this: condor_compile g77 integrate.f -o integrate

Page 21: Condor by Example

www.cs.wisc.edu/condor

Submit the Job

› Create a submit file:• emacs integrate.submit

&

› Submit the job:• condor_submit

integrate.submit

Executable = integrate

Arguments = 10 10000000

Output = integrate.out

Log = integrate.log

queue

Page 22: Condor by Example

www.cs.wisc.edu/condor

Watch the Progress

% condor_q

-- Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> :

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

5.0 thain 6/21 12:40 0+00:00:15 R 0 2.5 fib 40

Each job gets a unique number.

Status: Unexpanded, Running or Idle

Size of program image (MB)

Page 23: Condor by Example

www.cs.wisc.edu/condor

Receive E-mail When Done

This is an automated email from the Condor systemon machine "axpbo8.bo.infn.it". Do not reply.

Your condor job /tmp_mnt/usr/users/ccl/thain/test/fib 40exited with status 0.

Submitted at: Wed Jun 21 14:24:42 2000Completed at: Wed Jun 21 14:36:36 2000

Real Time: 0 00:11:54Run Time: 0 00:06:52Committed Time: 0 00:01:37. . .

Page 24: Condor by Example

www.cs.wisc.edu/condor

Running Many Processes

› 100 processes are almost as easy as !.

› Each condor_submit makes one cluster of one or more processes.

› Add the number of processes to run to the Queue statement.

› Use the $(PROCESS) variable to give each process slightly different instructions.

Page 25: Condor by Example

www.cs.wisc.edu/condor

Running Many Processes

› Perform the same program on 50 different intervals.

› Output goes in integrate.out.1, integrate.out.2, and so on…

Executable = integrate

Arguments = $(PROCESS) 10000000

Output = integrate.out.$(PROCESS)

Log = integrate.log

Queue 50

Page 26: Condor by Example

www.cs.wisc.edu/condor

Running Many Processes

% condor_q-- Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038>

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

9.3 thain 6/23 10:47 0+00:05:40 R 0 2.5 fib 3

9.6 thain 6/23 10:47 0+00:05:11 R 0 2.5 fib 6

9.7 thain 6/23 10:47 0+00:05:09 R 0 2.5 fib 7

. . .

21 jobs; 2 idle, 19 running, 0 held

Clusternumber

Process number

Page 27: Condor by Example

www.cs.wisc.edu/condor

Where Are They Running?

› condor_q –run- Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> :

ID OWNER SUBMITTED RUN_TIME HOST(S)

9.47 thain 6/23 10:47 0+00:07:03 ax4bbt.bo.infn.it

9.48 thain 6/23 10:47 0+00:06:51 pewobo1.bo.infn.it

9.49 thain 6/23 10:47 0+00:06:30 osde01.pd.infn.it

Current Location

Page 28: Condor by Example

www.cs.wisc.edu/condor

Help! I’m buried in Email!

› By default, Condor sends one email for each completed process.

› Add these to your submit file: notification = error notification = never

› To send it to someone else: notify_user = [email protected]

Page 29: Condor by Example

www.cs.wisc.edu/condor

Removing Processes

› Remove one process: condor_rm 9.47

› Remove a whole cluster: condor_rm 9

› Remove everything! condor_rm -a

Page 30: Condor by Example

www.cs.wisc.edu/condor

Getting Feedback

Page 31: Condor by Example

www.cs.wisc.edu/condor

What have I done?

› The user log file (fib.log) shows a chronological list of everything important that happened to a job.

001 (007.035.000) 06/21 17:03:44 Job executing on host: <140.105.6.155:2219>

004 (007.035.000) 06/21 17:04:58 Job was evicted.

009 (007.035.000) 06/21 17:05:10 Job was aborted by the user.

Page 32: Condor by Example

www.cs.wisc.edu/condor

What have I done?

% condor_history

ID OWNER SUBMITTED CPU_USAGE ST COMPLETED CMD

9.3 thain 6/23 10:47 0+00:00:00 C 6/23 10:58 fib 3

9.40 thain 6/23 10:47 0+00:00:24 C 6/23 10:59 fib 40

9.10 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 10

9.47 thain 6/23 10:47 0+00:05:45 C 6/23 11:01 fib 47

9.7 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 7

Page 33: Condor by Example

www.cs.wisc.edu/condor

Brief I/O Summary

% condor_q –io-- Schedd: c01.cs.wisc.edu : <128.105.146.101:2016>ID OWNER READ WRITE SEEK XPUT BUFSIZE BLKSIZE756.15 joe 244.9 KB 379.8 KB 71 1.3 KB/s 512.0 KB 32.0 KB758.24 joe 198.8 KB 219.5 KB 78 45.0 B /s 512.0 KB 32.0 KB758.26 joe 44.7 KB 22.1 KB 2727 13.0 B /s 512.0 KB 32.0 KB

3 jobs; 0 idle, 3 running, 0 held

Page 34: Condor by Example

www.cs.wisc.edu/condor

Complete I/O Summaryin Email

Your condor job "/usr/joe/records.remote input output" exited with status 0.

Total I/O:104.2 KB/s effective throughput5 files opened104 reads totaling 411.0 KB316 writes totaling 1.2 MB102 seeks

I/O by File:

buffered file /usr/joe/inputopened 2 times100 reads totaling 398.6 KB311 write totaling 1.2 MB101 seeks

(Only since Condor Version 6.1.11)

Page 35: Condor by Example

www.cs.wisc.edu/condor

Complete I/O Summaryin Email

› The summary helps identify performance problems. Even advanced users don't know exactly how their programs and libraries operate.

Page 36: Condor by Example

www.cs.wisc.edu/condor

Complete I/O Summary in Email

› Example: CMSSIM - collider simulation “Why is this job so slow?” Data summary:

• read 250 MB from 20 MB file. Very high SEEK total -> random access. Solution: Increase buffer to 20 MB.

Page 37: Condor by Example

www.cs.wisc.edu/condor

Who Uses Condor?

% condor_q –global

-- Schedd: to02xd.to.infn.it : <192.84.137.2:1030>

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

127.0 garzelli 6/21 18:45 1+14:18:16 R 0 17.2 tosti2trisdn

-- Schedd: quark.ts.infn.it : <140.105.6.101:3908>

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

600.0 dellaric 4/10 14:57 55+09:20:31 R 0 9.1 john p2.dat

665.0 dellaric 6/2 11:14 20+03:27:30 R 0 9.2 john p1.dat

788.0 pamela 6/20 09:27 3+04:41:43 R 0 15.4 montepamela

Page 38: Condor by Example

www.cs.wisc.edu/condor

Who uses Condor?

% condor_status –submittersName Machine Running IdleJobs MaxJobsRunning

[email protected] decux1.pv. 22 34 [email protected] quark.ts.i 6 1 [email protected] to05xd.to. 21 49 200. . . RunningJobs IdleJobs

[email protected] 0 [email protected] 6 [email protected] 22 34

Total 59 86

Page 39: Condor by Example

www.cs.wisc.edu/condor

Who Uses Condor?

% condor_userprioLast Priority Update: 6/23 16:27 EffectiveUser Name Priority------------------------------ [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] 19.72------------------------------ ---------Number of users shown: 8

Page 40: Condor by Example

www.cs.wisc.edu/condor

Who Uses Condor?

› The user priority is computed by Condor to estimate how much of the pool’s CPU resources have been used by each submitter.

› Lighter users receive a lower priority: they will be allocated CPUs before heavy users.

› Users consuming the same amount of CPU will be allocated an equal amount.

Page 41: Condor by Example

www.cs.wisc.edu/condor

Measuring Goodput

› Goodput is the amount of time a workstation spends making forward progress on work assigned by Condor.

› This is a big topic all by itself: http://www.cs.wisc.edu/condor/goodput

Page 42: Condor by Example

www.cs.wisc.edu/condor

Measuring Goodput

% condor_q –goodput-- Submitter: coral.cs.wisc.edu : <128.105.175.116:45697> : coral.cs.wisc.edu

ID OWNER SUBMITTED RUN_TIME GOODPUT CPU_UTIL Mb/s

719.74 thain 6/23 07:35 2+20:47:59 100.0% 87.6% 0.00

719.75 thain 6/23 07:35 2+20:38:45 40.5% 99.8% 0.00

719.76 thain 6/23 07:35 2+20:38:16 96.9% 98.7% 0.00

719.77 thain 6/23 07:35 2+21:10:06 100.0% 99.8% 0.00

Page 43: Condor by Example

www.cs.wisc.edu/condor

Setting Requirements

› We believe that Condor must allow both users (jobs) and owners (machines) to set requirements.

› This is an absolute necessity in order to convince people to participate in the community.

Page 44: Condor by Example

www.cs.wisc.edu/condor

ClassAds

› ClassAds are a simple language for describing both the properties and the requirements of jobs and machines.

› Condor stores nearly everything in ClassAds -- use the –l option to condor_q and condor_submit to get the full details.

Page 45: Condor by Example

www.cs.wisc.edu/condor

ClassAd for a Machine

› condor_status –l axpbo8

MyType = "Machine"TargetType = "Job"Name = "axpbo8.bo.infn.it"START = TRUEVirtualMemory = 342696Disk = 28728536Memory = 160Cpus = 1Arch = "ALPHA"OpSys = "OSF1“

Page 46: Condor by Example

www.cs.wisc.edu/condor

ClassAd for a Job› condor_q –l 9.49

MyType = "Job"TargetType = "Machine"Owner = "thain"Cmd = "/tmp_mnt/usr/users/ccl/thain/test/fib"Out = “fib.out.49”Args = “49”ImageSize = 2544DiskUsage = 2544Requirements = (Arch == "ALPHA") && (OpSys == "OSF1") && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)

Page 47: Condor by Example

www.cs.wisc.edu/condor

Default Requirements

› By default, Condor assumes the requirements for your job are: “I need a machine with…” The same operating system and

architecture as my workstation. Enough disk to store the program. Enough virtual memory to run the

program.

Page 48: Condor by Example

www.cs.wisc.edu/condor

ClassAd Requirements

› Similar to C/C++/Java expressions: Symbols: Arch, OpSys, Memory, Mips Values: 15, 6.5, “LINUX” Operators:

• ==, <, >, <=, >=• &&, ||• ( )

Page 49: Condor by Example

www.cs.wisc.edu/condor

Adding Requirements› In the submit file, add a line

beginning with “requirements = “

Executable = fib

Arguments = 40

Output = fib.out

Log = fib.log

Requirements = (Memory > 64)

queue

Page 50: Condor by Example

www.cs.wisc.edu/condor

Example Requirements

› (Memory>64)

› (Machine == “axpbo3.bo.infn.it” )

› (Mips>100) || (Kflops>10000)

› (Subnet != “131.154.10”) && (Disk > 20000000)

Page 51: Condor by Example

www.cs.wisc.edu/condor

Preferences

› Condor assumes that any machines that match your requirements are suitable.

› However, you may prefer some machines over others. (100 Mips is better than 10)

› To indicate a preference, you may provide a ClassAd expression which ranks all matches.

Page 52: Condor by Example

www.cs.wisc.edu/condor

Rank

› The rank expression is evaluated into a number for every potential matching machine.

› A machine with a higher number will be preferred over a machine with a lower number.

Page 53: Condor by Example

www.cs.wisc.edu/condor

Rank Examples› Prefer machines with more Mips:

• Rank = Mips

› Prefer machines with a high ratio of memory to cpu performance:

• Rank = Memory/Mips

› Prefer more memory, but add 100 to the rank if the machine is Solaris 2.7:

• Rank = Memory + 100*(OpSys==“SOLARIS27)”

Page 54: Condor by Example

www.cs.wisc.edu/condor

Standardor Vanilla?

Page 55: Condor by Example

www.cs.wisc.edu/condor

Which Universe?› Each Condor universe provides

different services to different kinds of programs: Standard – Relinked UNIX programs Vanilla – Unmodified UNIX programs PVM Scheduler (Not described here) Globus

Page 56: Condor by Example

www.cs.wisc.edu/condor

Standard Universe

› Submit a specially-linked UNIX application to the Condor system.

› Advantages: Checkpointing for fault tolerance. Remote I/O services:

• Friendly environment anywhere in the world.• Data buffering and staging.• I/O performance feedback.• User remapping of data sources.

Page 57: Condor by Example

www.cs.wisc.edu/condor

Standard Universe

› Disadvantages: Must statically link with Condor

library. Limited class of applications:

• Single-process UNIX binaries.• Certain system calls prohibited.

Page 58: Condor by Example

www.cs.wisc.edu/condor

System Call Limitations

› Standard universe does not allow: Multiple processes:

• fork(), exec(), system()

Inter-process communication:• semaphores, messages, shared memory

Complex I/O:• mmap(), select(), poll(), non-blocking I/O, …

Kernel-level threads• (User level threads are OK.)

Page 59: Condor by Example

www.cs.wisc.edu/condor

System Call Limitations

› Too restrictive? Use the vanilla universe.

Page 60: Condor by Example

www.cs.wisc.edu/condor

Vanilla Universe

› Submit any sort of UNIX program to the Condor system.

› Advantages: No relinking required. Any program at all, including

• Binaries• Shell scripts• Interpreted programs (java, perl)• Multiple processes

Page 61: Condor by Example

www.cs.wisc.edu/condor

Vanilla Universe

› Disadvantages: No checkpointing. Very limited remote I/O services.

• Specify input files explicitly.• Specify output files explicitly.

Condor will refuse to start a vanilla job on a machine that is unfriendly.

• ClassAds: FilesystemDomain and UIDDomain

Page 62: Condor by Example

www.cs.wisc.edu/condor

Which Universe?

› Standard: Good for mixed Condor pools, flocked

pools, and the Grid at large.

› Vanilla: Good for a Condor pool of identical

machines.

Page 63: Condor by Example

www.cs.wisc.edu/condor

Conclusion› Condor expands your reach to many

CPUs – even those you cannot log in to.› Condor makes it easy to run and

manage large numbers of jobs› Good candidates for the standard

universe are single-process CPU-bound jobs with simple I/O.

› Too restrictive? Use the vanilla universe, but fewer available machines.

Page 64: Condor by Example

www.cs.wisc.edu/condor

Move to Workshop

Meet again in room ____ at _____.Bring printouts to follow along.