Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
Condor-G: A Case in Distributed Job
Delegation
www.cs.wisc.edu/condor
Job Delegation
› Transfer of responsibility to schedule and execute a job
› Multiple delegations can form a chain
www.cs.wisc.edu/condor
Job Delegation in Condor-G Today
Condor-G
Globus GRAM
Batch System Front-end
Execute Machine
www.cs.wisc.edu/condor
Expanding the Model
› What can we do with new forms of job delegation?
› Some ideas Mirroring Load-balancing Glide-in schedd Multi-hop grid scheduling
www.cs.wisc.edu/condor
Mirroring
› What it does Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one
starts running jobs On recovery, primary Condor-G gets job
status from secondary one
› Removes Condor-G submit point as single point of failure
www.cs.wisc.edu/condor
Load-Balancing
› What it does Front-end Condor-G distributes all jobs
among several back-end Condor-Gs Front-end Condor-G keeps updated job
status
› Improves scalability
› Maintains single submit point for users
www.cs.wisc.edu/condor
Load-Balancing Example
Condor-G Back-end 1
Condor-G Front-end
Condor-G Back-end 3
Condor-G Back-end 2
www.cs.wisc.edu/condor
Glide-In Schedd
› What it does Drop a Condor-G onto the front-end
machine of a cluster Delegate jobs to the cluster through
the glide-in schedd
› Apply cluster-specific policies to jobs
www.cs.wisc.edu/condor
Multi-Hop Grid Scheduling
› Match a job to a Virtual Organization (VO), then to a resource within that VO
› Easier to schedule jobs across multiple VOs and grids
www.cs.wisc.edu/condor
Multi-Hop Grid Scheduling Example
Experiment Condor-G
Experiment Resource Broker
VO Condor-G
VO Resource Broker
Globus GRAM
Batch Scheduler
www.cs.wisc.edu/condor
Endless Possibilities
› These new models can be combined with each other or with other new models
› Resulting system can be arbitrarily sophisticated
www.cs.wisc.edu/condor
Job Delegation Challenges
› New complexity introduces new issues and exacerbates existing ones
› A few… Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging
www.cs.wisc.edu/condor
Transparency
› Full information about job should be available to user Information from full delegation path No manual tracing across multiple machines
› Users need to know what’s happening with their jobs
www.cs.wisc.edu/condor
Representation
› Job state is a vector› How best to show this to user
Summary• Current delegation endpoint• Job state at endpoint
Full information available if desired• Series of nested ClassAds?
www.cs.wisc.edu/condor
Scheduling Control
› Avoid loops in delegation path
› Give user control of scheduling Allow limiting of delegation path
length? Allow user to specify part or all of
delegation path
www.cs.wisc.edu/condor
Active Job Control
› User may request certain actions hold, suspend, vacate, checkpoint
› Actions cannot be completed synchronously for user Must forward along delegation path User checks completion later
www.cs.wisc.edu/condor
Active Job Control (cont)
› Endpoint systems may not support actions If possible, execute them at furthest
point that does support them
› Allow user to apply action in middle of delegation path
www.cs.wisc.edu/condor
Revocation
› Leases Lease must be renewed periodically
for delegation to remain valid Allows revocation during long-term
failures
› What are good values for lease lifetime and update interval?
www.cs.wisc.edu/condor
Error Handling and Debugging
› Many more places for things to go horribly wrong
› Need clear, simple error semantics
› Logs, logs, logs Have them everywhere
www.cs.wisc.edu/condor
Current Status
› Done Mirroring
› In Progress Condor-G -> Condor-G delegation
• User must specify hops
Glide-in schedd• Set up by hand