dan bradley university of wisconsin-madison condor and disun teams dan@hep.wisc.edu condor...
Post on 05-Jan-2016
214 Views
Preview:
TRANSCRIPT
Dan BradleyUniversity of Wisconsin-Madison
Condor and DISUN Teamsdan@hep.wisc.edu
http://www.cs.wisc.edu/condor
Condor Administrator’s How-to
Dan, Condor Week 2008www.cs.wisc.edu/condor
Where to Find the Online How-to
Collection1. Go to http://www.cs.wisc.edu/condor/
2. Click on “Condor Admin How-to Recipes”
Currently, that takes you here:
http://nmi.cs.wisc.edu/node/1465
Dan, Condor Week 2008www.cs.wisc.edu/condor
Brief Overviewof
Selected Bits
Dan, Condor Week 2008www.cs.wisc.edu/condor
Question
› How does Condor decide which job gets to run on an execute machine?
Dan, Condor Week 2008www.cs.wisc.edu/condor
The Life of a Condor Job
schedd(job queue)
condor_submit
startd(Job Executor)
central manager(collector + negotiator)
central manager 2
central manager 3(collector + negotiator)
flock
ing
machine ClassAd
job runs
job C
lassAd
Dan, Condor Week 2008www.cs.wisc.edu/condor
First Stop: Authorization› User must be authorized to submit to schedd
ALLOW_WRITE = allow1, allow2, …DENY_WRITE = deny1, deny2, …
user@uid_domain/network
› By defualt, all authenticated users may submit jobs within trusted network
ALLOW_WRITE = */networkHOSTALLOW_WRITE = network (old style)
Dan, Condor Week 2008www.cs.wisc.edu/condor
Next Stop: The Job Queue
› MAX_JOBS_RUNNING = 200
› Job priority = integer orders a user’s jobs higher priority will run sooner
Dan, Condor Week 2008www.cs.wisc.edu/condor
Authorization of the Schedd to Join Pool
› ALLOW_ADVERTISE_SCHEDDDENY_ADVERTISE_SCHEDD Default: ALLOW/DENY_DAEMON
• Default: ALLOW/DENY_WRITE
› COLLECTOR_REQUIREMENTS Default: true
Dan, Condor Week 2008www.cs.wisc.edu/condor
Next Stop: NegotiatorFair Share
• User priorityInversely proportional to fair share
• Example: two users, 60 batch slots• priority 50 - gets 40 slots• priority 100 - gets 20 slots
Dan, Condor Week 2008www.cs.wisc.edu/condor
Fair Share Dynamics
› User priority changes over time wants to be equal to number of slots in use
› Example: User steadily running 100 jobs: priority 100 Stops running jobs:
• 1 day later: priority 50• 2 days later: priority 25
› Configure speed of adjustment:PRIORITY_HALFLIFE = 86400
Dan, Condor Week 2008www.cs.wisc.edu/condor
Modified Fair Share› User Priority Factor
multiplies the “real user priority” result is called “effective user priority”
› Example:condor_userprio -setfactor atlas@hep.wisc.edu 4.0condor_userprio -setfactor cms@hep.wisc.edu 1.0 atlas steadily uses 10 slots - effective priority 40 cms steadily uses 20 slots - effective priority 20
Dan, Condor Week 2008www.cs.wisc.edu/condor
Reporting Condor Pool Usage
% condor_userprio -usage -allusersLast Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ----------------…osg_usatlas1@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24jherschleb@lmcg.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59szhou@lmcg.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59osg_cmsprod@hep.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59------------------------------ ----------- ---------------- ----------------Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00
› When upgrading Condor, preserve the central manager’s AccountantLog Happens automatically if you follow general rule:
preserve Condor’s LOCAL_DIR
Dan, Condor Week 2008www.cs.wisc.edu/condor
Matchmaking
› Job requirements and machine requirements must both be met
› Machine requirements are configured via the START expression
START = Owner == "appinstaller"
Dan, Condor Week 2008www.cs.wisc.edu/condor
Adding to Job Requirements
APPEND_REQUIREMENTS = MY.Owner != "appinstaller" || TARGET.IsAppInstallerMachine =?= True
Dan, Condor Week 2008www.cs.wisc.edu/condor
Adding Attribute to Machine ClassAd
IsAppInstallerMachine = True
STARTD_ATTRS = $(STARTD_ATTRS) IsAppInstallerMachine
Dan, Condor Week 2008www.cs.wisc.edu/condor
Choosing Between Matching Machines
1. NEGOTIATOR_PRE_JOB_RANK2. job rank expression3. NEGOTIATOR_POST_JOB_RANK4. PREEMPTION_RANK
Dan, Condor Week 2008www.cs.wisc.edu/condor
Example
NEGOTIATOR_PRE_JOB_RANK = (IsDesktop =!= True && isUndefined(RemoteOwner)) + isUndefined(RemoteOwner)
› Most desirable to least: 2 unclaimed and not a desktop 1 unclaimed and desktop 0 claimed
Dan, Condor Week 2008www.cs.wisc.edu/condor
Authorizing Schedd to Claim Startd
› ALLOW/DENY_WRITE
› It is the schedd which is authorized by the startd, not the user.
Dan, Condor Week 2008www.cs.wisc.edu/condor
Preemption
Dan, Condor Week 2008www.cs.wisc.edu/condor
Machine Rank
› Numerical expression: higher number preempts lower number user priority is secondary to rank, because
higher rank job preempts claim to machine
› Example: CMS gets 1st prio, CDF gets 2nd, others 3rdRANK = 2*(User == “cms@hep.wisc.edu”) + 1*(User == “cdf@hep.wisc.edu”)
Dan, Condor Week 2008www.cs.wisc.edu/condor
Another Rank Example
Rank = (Group =?= "LMCG") * (1000 + RushJob)
Dan, Condor Week 2008www.cs.wisc.edu/condor
Note on Scope of Condor Policies
› pool-wide scope: example negotiator user priorities, factors, etc. preemption policy related to user priority steering jobs via negotiator job rank
› execute machine/slot scope: startd machine rank, requirements preemption/suspension policy customized machine ClassAd values
› submit machine scope queue policy, automatic additions to job requirements,
and insertion of arbitrary ClassAd attributes into job
› personal scope environmental configurations: _CONDOR_<config val>=value
Dan, Condor Week 2008www.cs.wisc.edu/condor
Preemption Policy› Should Condor jobs yield to non-condor
activity on the machine?
› Should some types of jobs never be interrupted? After 4 days?
› Should some jobs immediately preempt others? After 30 minutes?
› Is suspension more desirable than killing?
› Can need for preemption be decreased by steering jobs towards the right machines?
Dan, Condor Week 2008www.cs.wisc.edu/condor
Example Preemption Policy
When a claim is preempted, do not allow killing of jobs younger than 4 days old.
MaxJobRetirementTime = 3600 * 24 * 4
› Applies to all forms of preemption: user priority, machine rank, machine
activity, graceful shutdown
Dan, Condor Week 2008www.cs.wisc.edu/condor
Another Preemption Policy
› Expression can refer to attributes of batch slot and job, so can be highly customized.
MaxJobRetirementTime = 3600 * 24 * 4 * (OSG_VO =?= “uscms”)
Dan, Condor Week 2008www.cs.wisc.edu/condor
More Preemption Controls
› PREEMPTION_REQUIREMENTS controls user-priority based preemption at
the level of the negotiator
› PREEMPT/SUSPEND controls preemption by machine activity
(e.g. keyboard or cpu activity)
› RANK allows preemption by more desirable jobs
Dan, Condor Week 2008www.cs.wisc.edu/condor
Preemption Policy Pitfall
› If you disable all forms of preemption, you probably want to limit lifespan of claims:
PREEMPTION_REQUIRMENTS = FalsePREEMPT = FalseRANK = 0CLAIM_WORKLIFE = 3600
• Otherwise, reallocation of resources will not happen until a user runs out of matching jobs.
Dan, Condor Week 2008www.cs.wisc.edu/condor
What Happens to Preempted Jobs?
› Back to idle in job queue NumJobStarts >= 1
› job policy:periodic_hold, periodic_remove
› admin policy:SYSTEM_PERIODIC_HOLDSYSTEM_PERIODIC_REMOVE
Dan, Condor Week 2008www.cs.wisc.edu/condor
Back to the Negotiator:Group Accounting
Dan, Condor Week 2008www.cs.wisc.edu/condor
Fair Sharing Between Groups
• Useful when:• multiple user ids belong to same group• group’s share of pool is not tied to specific machines
# Example group settingsGROUP_NAMES = group_physics, group_chemistry
GROUP_QUOTA_group_physics = 200GROUP_QUOTA_group_chemistry = 100GROUP_AUTOREGROUP = True
GROUP_PRIO_FACTOR_group_physics = 10GROUP_PRIO_FACTOR_group_chemistry = 10DEFAULT_PRIO_FACTOR = 100
Dan, Condor Week 2008www.cs.wisc.edu/condor
Setting Group Identity
• The job advertises its own group identity:
+AccountingGroup = “group_physics.dan”
group name group user
• Anyone can declare any identity.• This is not the unix/windows identity the job runs as.• It is solely for accounting and prioritization purposes.
Dan, Condor Week 2008www.cs.wisc.edu/condor
Monitoring Usage
% condor_userprio -usage -allusersLast Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ----------------…group_physics.atlas@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24group_physics.cms@hep.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59group_chemistry.han@che.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59group_chemistry.ben@che.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59------------------------------ ----------- ---------------- ----------------Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00
% condor_userprio -all -allusers
Dan, Condor Week 2008www.cs.wisc.edu/condor
How do groups compete?
› Group using least share of its quota gets top priority in matchmaking.
Dan, Condor Week 2008www.cs.wisc.edu/condor
How do user’s within group compete?
› Each group user has its own user priority
› Fair share between group members determined by the usual user priority mechanism
Dan, Condor Week 2008www.cs.wisc.edu/condor
May Group Exceed its Quota?
› Yes, but only if
GROUP_AUTOREGROUP = True
OR, if undefinedGROUP_AUTOREGROUP_<groupname> = True
Dan, Condor Week 2008www.cs.wisc.edu/condor
When Exceeding Quota, How do Users
Compete?› All non-group users plus group users
trying to exceed their quota compete for remaining machines.
› The user priority of the group user (e.g. “group_physics.dan”) is used to determine fair share. Can set default priority factor for all
members of group:GROUP_PRIO_FACTOR_<groupname> = 10
Dan, Condor Week 2008www.cs.wisc.edu/condor
The End of the Story
Dan, Condor Week 2008www.cs.wisc.edu/condor
The Life of a Condor Job
schedd(job queue)
condor_submit
startd(Job Executor)
central manager(collector + negotiator)
central manager 2
central manager 3(collector + negotiator)
flock
ing
machine ClassAd
job runs
job C
lassAd
Dan, Condor Week 2008www.cs.wisc.edu/condor
Extending the Reach
› FLOCK_TO = <remote collector> requires bi-directional connectivity in Linux, can use GCB to connect
private networks
› Grid Universe: Globus, Condor-C condor_glidein JobRouter
Dan, Condor Week 2008www.cs.wisc.edu/condor
Trivia
› What’s the difference?
IsHighPrioUser = Owner == “dan”
1. RANK = IsHighPrioUser2. RANK = $(IsHighPrioUser)
› case 1 needs:STARTD_ATTRS = IsHighPrioUser
Dan, Condor Week 2008www.cs.wisc.edu/condor
Where to Find the Online How-to
Collection1. Go to http://www.cs.wisc.edu/condor/
2. Click on “Condor Admin How-to Recipes”
Currently, that takes you here:
http://nmi.cs.wisc.edu/node/1465
top related