dynamic condor universe
TRANSCRIPT
-
8/14/2019 Dynamic Condor Universe
1/10
Exploring Virtual Workspace Concepts in a Dynamic Universe for CondorQuinn Lewis
ABSTRACT
Virtualization offers a cost-effective and flexible way to use and manage
computing resources. Such an abstraction is appealing in grid computing for better
matching jobs (applications) to computational resources. This paper applies the virtual
workspace concept introduced in the Globus Toolkit to the Condor workload
management system. It allows existing computing resources to be dynamically
provisioned at run-time by users based on application requirements instead of statically at
design-time.
INTRODUCTION
A common goal of computer systems is to minimize cost while maximizing other
criteria, such as performance, reliability, and scalability, to achieve the objectives of the
user(s). In Grid computing, a scalable way to harness large amounts of computing power
across various organizations is to amass several relatively inexpensive computing
resources together. Coordinating these distributed and heterogeneous computing
resources for the purposes of perhaps several users can be difficult. In such an
environment, resource consumers have several varying, specific, and demanding
requirements and preferences for how they would like their applications and services to
leverage the resources made available by resource providers. Resource providers must
ensure the resources meet a certain quality of service (e.g. make resources securely and
consistently available to several concurrent users).
In the past, control over the availability, quantity, and software configurations of
resources has been limited to the resource provider. With virtualization, it becomes
-
8/14/2019 Dynamic Condor Universe
2/10
possible for resource providers to offer up more control of the resources to a user without
sacrificing quality of service to other resource consumers. Users (resource consumers)
can more easily create execution environments that meet the needs of their applications
and jobs within the policies defined by the resource providers. Such a relationship,
enabled by virtualization, is both cost-effective and flexible for the resource producer and
consumer. [1]
The virtual workspace term, initially coined in [2] for use with the Globus
Toolkit, "is an abstraction of an execution environment that can be made dynamically
available to authorized clients by using well-defined protocols". This execution
environment can encompass several physical resources. Generically, this concept could
be implemented in various ways; however, virtualization has proven itself to be a
practicable implementation. [3]
Condor, "is a specialized workload management system for compute-intensive
jobs" [4]. Condor currently abstracts the resources of a single physical machine into
virtual machines which can run multiple jobs at the same time [5]. A "universe" is used
to statically describe the execution environment in which the jobs are expected to run.
This approach assumes the resources (whether real or virtual) have to all be allocated in
advance. While there is support for adding more resources to an existing pool via the
Glide-in mechanism, the user still has to dedicate the use of these other physical
resources.
The purpose of this paper is to describe how a Condor execution environment
(universe) can be dynamically created at run-time by users to more flexibly and cost-
effectively use and manage existing resources using virtualization. Two of the unique
-
8/14/2019 Dynamic Condor Universe
3/10
implementation details described in this paper are the use of Microsoft Windows and
Microsoft Virtual Server 2005 R2 for the virtual machine manager (VMM) on the host
operating system (instead of being Linux-based using Xen or VMWare) and the use of
differencing virtual hard disks. More details about virtual workspaces and similar
attempts to virtualize Condor are described in Related Work. The implementation details
of the work performed for a dynamic Condor universe are provided along with
performance tests results. Future enhancements are included for making this work-in-
progress more robust.
RELATED WORK
While virtualization has a number of applications for business computing and
software development and testing, the work outlined in this paper most directly applies to
technical computing, including Grid computing, clusters, and resource-scavenging
systems.
Grid Computing
The use of virtualization in Grid computing has been proposed before, touting the
benefits of legacy application support, improved security, and the ability to deploy
computation independently of site administration. The challenges of dynamically
creating and managing virtual machines are also described [6]. The virtual workspace
concept [7] extended [6] to present "a unified abstraction" and address additional issues
associated with the complexities of managing such an environment in the Grid. Two key
differences between the Grid-related work mentioned and this paper is the emphasis on
dynamically creating the execution environment at run-time and the (Microsoft)
virtualization software employed.
-
8/14/2019 Dynamic Condor Universe
4/10
As mentioned previously, the Condor Glide-in mechanism works in conjunction
with the Globus Toolkit to temporarily make Globus resources available to a users
Condor pool. This has the advantage of being able to submit Condor jobs using Condor
capabilities (matchmaking and scheduling) on Globus managed resources [8]. However,
it is expected that the user acquire these remote resources before the jobs are executed.
Using virtualization allows the existing local Condor resources to be leveraged as the
jobs require.
Clusters
Many of the same motivations that exist for this work have also been applied to
clusters [9, 10] but focus more on dynamically provisioning homogenous execution
environments on resources. Although perhaps accommodated in the design of the
Cluster-on-Demand [9], virtualization technology is not used in the implementation of the
system. The resources are assumed to physically exist and the software is deployed by
re-imaging the machine. In [10], virtualization is used to provision the software on the
cluster(s) but the time required to stage in the virtual image(s) is costly. The use of the
differencing virtual hard disk image type in this work offers a mitigating solution to
this problem [11].
Condor
Additional work with virtualization and Condor focuses on exploiting Condors
cycle stealing capability at the University of Nebraska Lincoln to transform typical
Windows campus machines into Unix-based machines required by researchers [12]. The
solution leveraged coLinux to run a Condor compute node through a Windows device
driver [13]. While some of the same motivation exists for this work, using a
-
8/14/2019 Dynamic Condor Universe
5/10
virtualization technology such as Virtual Server 2005 R2 allows other operating systems
and versions to be used and provides more flexible ways to programmatically control the
dynamic environment.
IMPLEMENTATION
We leverage Condors existing ability to schedule jobs, advertise resource
availability, and match jobs to resources and introduce a flexible extension for
dynamically describing, deploying, and using virtual execution resources in the Condor
universe.
In Condor, one or more machines (resources) along with jobs (resource requests)
are part of a collection, known as a pool. The resources in the pool have one or more of
the following roles: Central Manager, Execute, and/or Submit. The Central Manager
collects information and negotiates how jobs are matched to available resources. Submit
resources allow jobs to be submitted to the Condor pool through a description of the job
and its requirements. Execute resources run jobs submitted by users after having been
matched and negotiated by the Central Manager. [14]
We extend the responsibilities of each of these three different roles to incorporate
virtualization into Condor. Each Execute resource describes the extent to which it can be
virtualized (to the Central Manager) and is responsible for hosting additional (virtual)
resources. The Submit resource(s) takes a workflow of jobs and requirements and
initiates the deployment of the virtual resources plus signals its usage (start/stop) to the
host/execute machine. The Central Manager is responsible for storing virtual machine
metadata used for scheduling. For this implementation, a single machine is used for the
Central Manager, Submit, and Execute roles.
-
8/14/2019 Dynamic Condor Universe
6/10
The virtualization capabilities for a particular Execute resource can be published
to the Central Manager via authorized use ofcondor_advertise. Attributes about the
virtual Execute resources, such as the operating system (and version), available memory
and disk space, and more specific data about the status of the virtual machine are
included. Currently, the host Execute resource invokes condor_advertise for each
guest or virtual Execute resource it anticipates hosting at start-up. This approach
allows virtual resources to appear almost indistinguishable from real physical resources
and will be included in Condors resource scheduling. Note that real resources are
running while the virtual resources are not. They have only been described.
Using the standard Condor tools, such as condor_status, users can view the
resources (real and virtual) available in the pool. Users can then create workflows (using
Windows Workflow Foundation [16]) for one or more jobs that intend to run on the
provided resources. Since the virtual resource(s) may not be running when a job is
submitted, the initial scheduling will fail. Fortunately, Condor provides a SOAP-based
API for submitting and querying jobs [15]. Using this Condor API via workflows,
unsuccessful job submissions can be checked for the intended attributes of the advertised
machine to determine if the resource is a virtual machine and if it needs to be deployed,
and/or if it needs to be started.
The user can indicate specific job requirements in the workflow. These
requirements can optionally specify the location of the files required to run the virtual
machine for consumer flexibility (assuming the provider has allowed it). These files
provide the operating system and necessary configuration (including Condor) for
executing the job. The workflow is invoked by the Submit machine. If the virtual
-
8/14/2019 Dynamic Condor Universe
7/10
resource is specified by the workflow, the workflow manager on the Submit machine
either transfers the virtual machine files to the Execute resource or provides the Execute
resource with the location and protocol for retrieving the virtual machine files. (The
automatic copying of virtual images was not completely implemented for this paper.) For
performance, it is expected that host Execute machines have base virtual images local to
the resource that provide the operating system and Condor. Additional software and
configuration can be added by in a separate file that only stores the modified blocks from
a parent hard disk (file), called differencing virtual disks. This provides a flexible
balance, allowing resource providers to provide base images and giving resource
consumers the ability to extend the base images.
The workflow, running on the Submit machine, also provides the logic for starting
the virtual resource on the host. Microsoft Virtual Server R2 provides an API for
managing local and remote virtual machines. The workflow leverages this API for
starting the virtual resources. For this paper, it assumes that virtual resources are started
from a cold state. The result is that startup times are as long as a normal boot time for
the respective operating system.
PERFORMANCE TESTS AND MEASUREMENTS
To test performance, a 2GHz AMD Athlon 64 processor with 1 GB of RAM
running Windows XP was used as the Central Manager, Execute, and Submit role. Two
virtual Execute machines, running Debian Linux 3.1 and Windows 2000, each with 128
MB RAM were created. A virtual network was created to allow communication between
the three different operating systems, each running Condor.
-
8/14/2019 Dynamic Condor Universe
8/10
The MEME [17] bioinformatics application was used as the test job. Initially, a
MEME job was submitted to the Condor pool using the standard Condor command-line
tools (e.g. condor_submit). The test input and configuration options were used resulting
in job submission, execution, and result times of less than one minute.
Using Windows Workflow Foundation and Visual Studio, a graphical workflow
was constructed that submitted the same MEME job to the cluster, specifically requesting
a Windows 2000 or Linux resource. The same test input and configuration options took 6
to 8 minutes on average. Since the virtual machines are programmatically started only
after an initial job schedule fails and are currently starting from a cold state, the start
times include the setup and also reflects the time for the operating system to boot. There
is also an unresolved issue with the (5 minute) cycle time between scheduling when using
the Condor SOAP API [18].
Additionally, the Windows 2000 virtual machine was created as a base image
(932 MB) with a differencing virtual disk that included Condor and other support
software (684 MB). Since the differencing disks use a sector bitmap to indicate which
sectors are within the current disk (1s) or on the parent (0s), the specification [11]
suggests it may be possible to achieve performance improvements. It also lent itself well
to compression. The 684 MB difference disk was compressed to 116 MB (using standard
ZIP compression). This file could be transferred over a standard broadband Internet
connection in 3.7 minutes (at 511.88 Kb/s) as opposed to 30 minutes.
CONCLUSION AND FUTURE WORK
A number of additional modifications are required for this solution to become
more robust. For example, security was not considered. Also, the current times for
-
8/14/2019 Dynamic Condor Universe
9/10
executing short running jobs are not acceptable. Another improvement would be to start
the virtual machines from a hot or paused state. Since the virtual machines used in this
exercise were DHCP, the virtual machines would need to have static IPs or have
additional knowledge of when the virtual machines are un-paused. The virtual hard
disk(s) may be further compressed using a specific compression algorithm that takes the
disk format into account. Performance considerations could also be given to differencing
hard disks that are chained together for application extensibility purposes.
This paper describes a mechanism for extending Condor to take advantage of
virtualization to more flexibly (and cost-effectively) create an execution environment at
run-time that balances the interests of the resource providers and consumers.
-
8/14/2019 Dynamic Condor Universe
10/10
REFERENCES
1. Keahey, K., Foster, I., Freeman, T., Zhang, X. Virtual Workspaces: AchievingQuality of Service and Quality of Life in the Grid. CCGRID 2006, Singapore,
May 2006.
2. Keahey, K., Foster, I., Freeman, T., Zhang, X., Galron, D. Virtual Workspaces inthe Grid. Europar 2005, Lisbon, Portugal, September, 2005.
3. http://workspace.globus.org/vm
4. http://www.cs.wisc.edu/condor/description.html5. http://www.bo.infn.it/alice/alice-doc/mll-doc/condor/node4.html
6. Figueiredo, R., inda, P., Fortes, Jose. A Case For Grid Computing On Virtual
Machines.
7. Keahey, K., Ripeanu, M., Doering, K. Dynamic Creation and Management ofRuntime Environments in the Grid.
8. http://www.cs.wisc.edu/condor/CondorWeek2005/presentations/user_tutorial.ppt
9. Chase, J., Irwin, D., Grit, L., Moore, J., Sprenkle, S. Dynamic Virtual Clusters in
a Grid Site Manager.10. Zhang, X., Keahey, K., Foster, I., Freeman, T. Virtual Cluster Workspaces for
Grid Applications.11. Virtual Hard Disk Image Form Specification. October 11, 2006 Version 1.0.
Microsoft.
12. Sumanth, J. Running Condor in a Virtual Environment with coLinux.
http://www.cs.wisc.edu/condor/CondorWeek2006/presentations/sumanth_condor_colinux.ppt
13. Santosa, M., Schaefer, A. Build a heterogeneous cluster with coLinux and
openMosix. http://www-128.ibm.com/developerworks/linux/library/l-colinux/index.html
14. Condor Version 6.9.2 Manual. http://www.cs.wisc.edu/condor/manual/v6.9/
15. http://www.cs.wisc.edu/condor/birdbath/16. http://wf.netfx3.com/content/WFIntro.aspx
17. MEME. http://meme.sdsc.edu
18. https://lists.cs.wisc.edu/archive/condor-users/2006-May/msg00296.shtml
http://www.cs.wisc.edu/condor/CondorWeek2005/presentations/user_tutorial.ppthttp://www.cs.wisc.edu/condor/CondorWeek2006/presentations/sumanth_condor_colinux.ppthttp://www.cs.wisc.edu/condor/CondorWeek2006/presentations/sumanth_condor_colinux.ppthttp://www-128.ibm.com/developerworks/linux/library/l-colinux/index.htmlhttp://www-128.ibm.com/developerworks/linux/library/l-colinux/index.htmlhttp://www.cs.wisc.edu/condor/manual/v6.9/http://www.cs.wisc.edu/condor/birdbath/http://wf.netfx3.com/content/WFIntro.aspxhttps://lists.cs.wisc.edu/archive/condor-users/2006-May/msg00296.shtmlhttp://www.cs.wisc.edu/condor/CondorWeek2005/presentations/user_tutorial.ppthttp://www.cs.wisc.edu/condor/CondorWeek2006/presentations/sumanth_condor_colinux.ppthttp://www.cs.wisc.edu/condor/CondorWeek2006/presentations/sumanth_condor_colinux.ppthttp://www-128.ibm.com/developerworks/linux/library/l-colinux/index.htmlhttp://www-128.ibm.com/developerworks/linux/library/l-colinux/index.htmlhttp://www.cs.wisc.edu/condor/manual/v6.9/http://www.cs.wisc.edu/condor/birdbath/http://wf.netfx3.com/content/WFIntro.aspxhttps://lists.cs.wisc.edu/archive/condor-users/2006-May/msg00296.shtml