setting up of condor scheduler on computing cluster raman sehgal npd-barc
Post on 16-Dec-2015
225 Views
Preview:
TRANSCRIPT
Setting up of condor scheduler on computing cluster
Raman SehgalNPD-BARC
Outline
Softwares required for the cluster.Network Topology CONDOR and different roles of machine in condor pool.Various Condor Environments.Pre-requisite of CondorConfiguration of Condor on our LAN.Running jobs using Condor and some commonly used condor commandsMPIConclusion
`
Softwares required for the cluster
•The Cluster requires following softwares:
•Operating System : Scientific Linux CERN 5.4 – 64 bit version
•Cluster Management : Management is done through IPMI (inbuilt)
•Cluster Usage and Statistics : Using “Ganglia”
•Cluster Middleware : CONDOR
•Parallel Programming Environment : MPI
Network Topology
•One head node containing all users
home directory
•16 worker node that will provide
computational power
•Head node will be connected to both public and private network
Public Network : Allow users to login on Head Node
Private Network : Connect all worker node to head node using Gigabit and Infiniband Network.
Used for job submission and execution
•File System : Network File System to have a shared area among Head node and Worker nodes
Prototype distributed and parallel computing environment•A Prototype distributed and parallel computing environment for the cluster is setup on a LAN of 4 computer.
•Distributed computing environment : Using CONDOR
•Parallel computing environment : Using MPI
CONDOR•Condor is an open source high-throughput computing software package for distributed parallelization of computationally intensive task.
•Used to manage workload on a cluster of computing nodes.
•It can integrate both dedicated resourced (rack mounted clusters) and non dedicated
desktop machines by making use of cycle scavenging.•Can run both sequential and parallel jobs.
•Provide different universes to run jobs (vanilla, standard, MPI, Java etc..)
Condor exceptional features:•Checkpointing and Migration
•Remote System calls
•No changes are necessary to user source code
•Sensitive to desires of Machine owner (in case of non dedicated machine).
Different roles of a machine in condor
pool•Central Manager : The main administration
machine.
•Execute : These are machine where job executes.
•Submit : These machine are used to submit the
job.
Various Condor DaemonsFollowing condor daemons runs on different machine in the condor pool
•Condor_master : Take care of rest of the daemons running on a machine.
•Condor_collector : Responsible for collecting information about status of pool
•Condor_negotiator : Responsible for all the match-making within Condor System
•Condor_schedd : This daemon represent resource requests to the Condor pool
•Condor_startd : This daemon represents a given resource to the Condor pool
Various condor environment to run different types of jobs
Condor provides several universes to run different type of jobs, some are as follows:
Standard : This universe provides condor’s full power, it provides following features
1. Checkpointing 2. Migration 3. Remote System Calls.
•The job needs to be relinked with condor libraries in order to run in standard universe.•This can be easily achieved by putting condor_compile in front of usual link commandEg. Normal linking of a job : gcc –o my_prog my_prog.c for standard universe the job is prepared by condor_compile gcc –o my_prog my_prog.c Now this job can utilize the power of standard universe.
Vanilla : This universe is intended for programs which cannot be successfully relinked with condor libraries.
1. Shell scripts are one of the example of jobs where vanilla is useful.
2. Jobs that run under vanilla universe cannot utilize checkpointing or remote system calls.
3. Since remote system call feature is not available so we need a
shared file system such as NFS or AFS.
Parallel : This universe allow parallel programs, such as MPI jobs to be run in
condor environment.
Prerequisites of condor configurationSetup of Private network of machine in computing poolPasswordless login from submit machines to all execute machine (rsh or ssh)
Configuration of condor on our small LAN of 4 computers•On our LAN of four machines we have one head node and remaining 3 worker nodes
•Condor is installed and configured on our pool and role of each machine is mentioned below:
1. Head Node : Central Manager, Submit
2. Worker Node : Execute
•Home directory of all the users resides on head node.
•These home directories resides in a shared area (using NFS) which can be accessed
by all the worker nodes. (required for vanilla universe).
•Now user can submit job from their home directories.
Running jobs using Condor
Following are the steps to run the condor job.•Prepare the Code
•Chose the Condor Universe
•Make the Submit description file (submit.ip), a sample file is shown below:# # # # # # # # # # # # # # # # # # # # # # # ##Sample Submit Description file # # # # # # # # # # # # # # # # # # # # # # # #Executable = getIpUniverse = standardOutput = getIp.outError = getIp.errLog = hello.logQueue 15
•Submit the Job: Now this job can be submitted by following condor commandCondor_submit submit.ip
Commonly used Condor commandsCondor_submit : Used to submit the job
Condor_q : displays information about jobs in condor job queue
Condor_status : used to monitor, query and display status of the Condor pool
`
Condor_history : helps the users to View log of Condor jobs completed up to date.
Condor_rm : removes one or more jobs from the Condor job queue.Condor_compile : used to relink the job with condor libraries, so that now it can be executed in standard universe.
MPI•MPI is language independent communication
protocol used to do parallel programming.
•Different languages provides their wrapper
compiler to do MPI programming.
•Here we have installed MPICH that will
allow us to do parallel programming in
C,C++, fortran etc.
•Computation v/s communication.
•SISD,SIMD,MISD,MIMD (Flynn’s classification)
•MPI requires the executable to present on
all the machine in the pool
•This achieved via NFS shared area.
•Testing is done through matrix multiplication program.
•Considerable reduction in execution time.
Conclusion:CONDOR is installed and configured on a small LAN of 4 computers and it is working properly and is giving expected results.Later on this prototype setup will be replicated on a computing cluster having 16 worker nodes that will provide a processing power of 1.3 TFlops plus a storage of 20 TBytes.The setup is also ready to run parallel jobs. So in future if we have some parallel job application then we are ready for it.
``
top related