ian c. smith* introduction to research computing using condor *advanced research computing...

47
Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Upload: lynette-garrison

Post on 16-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Ian C. Smith*

Introduction to research computing using Condor

*Advanced Research Computing

University of Liverpool

Page 2: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Overview

what is Condor and what can it be used for ?

typical Condor pool operation

University of Liverpool Condor Pool

support for MATLAB and R applications

some research computing examples

quick introduction to UNIX with a walk-through example

Page 3: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

What is Condor ?

a specialized system for delivering High Throughput Computing

a harvester of unused computing resources

developed by Computer Science Dept at University of Wisconsin in late ‘80s

free and (now) open source software

widely used in academia and increasing in industry

available for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OS

Page 4: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Types of Condor application

typically - large numbers of independent calculations (“pleasantly parallel”)

data parallel applications – split large datasets into smaller parts and process them in parallel

biological sequence analysis (e.g. BLAST)

processing of field trial data

optimisation problems

microprocessor design and testing

applications based on Monte Carlo methods

radiotherapy treatment analysis

epidemiological studies

Page 5: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

A “typical” Condor pool

Condor Server

Desktop PC

Execute hostsExecute hosts

login and upload input data

Page 6: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

A “typical” Condor pool

Condor Server

Desktop PC

Execute hostsExecute hosts

jobsjobs

Page 7: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

A “typical” Condor pool

Condor Server

Desktop PC

Execute hostsExecute hosts

results results

Page 8: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

A “typical” Condor pool

Condor Server

Desktop PC

Execute hostsExecute hosts

download results

Page 9: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

University of Liverpool Condor Pool

contains around 700 classroom PCs running the CSD Managed Windows 7 Service (mostly 64 bit from next year)

most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots per PC (total of 1400 job slots)

single job submission point for Condor jobs provided by powerful UNIX server

jobs continue to run while classroom PCs are unused but ...

if load (or memory use) becomes significant, job will be killed and usually any results will be lost (job will start again from scratch)

tools provided for running large numbers of MATLAB and R jobs

Page 10: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Condor caveats

only suitable for non-interactive applications

no communication between jobs possible

all files needed by application must be present on local disk

shorter jobs more likely to run to completion (10-20 min seems to work best)

long running jobs can be run if save/restore mechanism (checkpointing) is built into them

tricky to begin with but usually worth the initial effort

Page 11: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Running MATLAB jobs under Condor

need to create standalone application from M-file(s) using MATLAB compiler

standalone application can run without a MATLAB license

run-time libraries still need to be accessible to MATLAB jobs

nearly all toolbox functions available to standalone applications

simple (but powerful) file input/output makes checkpointing easier

tools available to simplify job submission - see Liverpool Condor website for more information

Page 12: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Running R jobs under Condor

limited support at present

R is installed on-the-fly as part of the job

currently only R version 2.6.2 available with standard packages

tools available to simplify job submission

checkpointing may be possible for long running jobs

Page 13: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Personalised Medicine example

project is a Genome-Wide Association Study

aims to identify genetic predictors of response to anti-epileptic drugs

try to identify regions of the human genome that differ between individuals (referred to as SNPs)

800 patients genotyped at 500 000 SNPs along the entire genome

test statistically the association between SNPs and outcomes (e.g. time to withdrawal of drug due to adverse effects)

very large data-parallel problem using R – ideal for Condor

divide datasets into small partitions so that individual jobs run for 15-30 minutes

batch of 26 chromosomes (2 600 jobs) required ~ 5 hours wallclock time on Condor but ~ 5 weeks on a single PC

Page 14: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Radiotherapy example

large 3rd party application code which simulates photon beam radiotherapy treatment using Monte Carlo methods

tried running simulation on 56 cores of high performance computing cluster but no progress after 5 weeks

divided problem into 250 then 5 000 and eventually 50 000 Condor jobs

required ~ 2 600 days of cpu time (equivalent to ~ 3.5 years on dual core PC)

Condor simulation completed in less than one week

average run time was ~ 70 min

only ~ 10 % of compute time wasted due to evictions

Page 15: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Condor service prerequisites

will need a Sun UNIX service account (contact CSD [email protected]) and a Condor account (http://www.liv.ac.uk/csd/registration/eScienceform.pdf)

to login in to the Condor server:

on MWS use PuTTy: Install University Applications | Internet | PuTTy 0.60

Mac/Linux: open terminal window and use ssh

off campus: use Apps Anywhere (PuTTy is in Utilities group)

to upload/download files to/from the Condor server:

on MWS use CoreFTPLite: Install University Applications | Internet | CoreFTP LE2.1

Mac/Linux: open terminal window, use sftp/scp

off campus: need to use virtual private network (VPN), then FTP

Page 16: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

PuTTy login

Page 17: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

PuTTy login

Page 18: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

PuTTy login

Page 19: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

CoreFTP Lite

Page 20: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

CoreFTP Lite

Page 21: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

CoreFTP Lite

Page 22: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

CoreFTP Lite

Page 23: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

CoreFTP Lite

Page 24: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

CoreFTP Lite

Page 25: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

CoreFTP Lite – download files

Page 26: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

CoreFTP Lite – download files

Page 27: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Condor server directory tree/ or ‘root’

/usr /bin /sbin /tmp/home/condor_data

Page 28: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Condor server directory tree/

/home/fred/home/smithic /home/jim

/home

login ‘home’directories

/tmp/usr /bin /sbin/condor_data

Page 29: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Condor server directory tree

/condor_data

/condor_data/smithic /condor_data/jim

/usr /bin /sbin /home /tmp

/

‘home’directories for Condor

Page 30: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

MATLAB Condor example

calculate the sum of p matrix-matrix products:

each product calculation is independent and can be performed in parallel

MATLAB M-file (product.m):

function product load input.mat; C=A*B; save( 'output.mat', 'C' ); quit;

Page 31: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory

Page 32: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples

Page 33: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab

Page 34: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab[smithic@ulgp5 matlab]$ ls #list filesinput0.mat input2.mat input4.mat productinput1.mat input3.mat product.m

Page 35: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab[smithic@ulgp5 matlab]$ ls #list filesinput0.mat input2.mat input4.mat productinput1.mat input3.mat product.m[smithic@ulgp5 matlab]$ matlab_build product.m #create standalone executableSubmitting job(s).1 job(s) submitted to cluster 503.

Page 36: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab[smithic@ulgp5 matlab]$ ls #list filesinput0.mat input2.mat input4.mat productinput1.mat input3.mat product.m product.exe[smithic@ulgp5 matlab]$ matlab_build product.m #create standalone executableSubmitting job(s).1 job(s) submitted to cluster 503.[smithic@ulgp5 matlab]$ condor_q #get Condor queue status-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap

Page 37: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab[smithic@ulgp5 matlab]$ ls #list filesinput0.mat input2.mat input4.mat productinput1.mat input3.mat product.m product.exe[smithic@ulgp5 matlab]$ matlab_build product.m #create standalone executableSubmitting job(s).1 job(s) submitted to cluster 503.[smithic@ulgp5 matlab]$ condor_q #get Condor queue status-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap

1 jobs; 0 idle, 1 running, 0 held[smithic@ulgp5 matlab]$ condor_q #job has finished when gone from queue-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

Page 38: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 matlab]$ lsinput0.mat input2.mat input4.mat product.bat product.exe.manifest product.subinput1.mat input3.mat product product.exe product.m

Page 39: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 matlab]$ lsinput0.mat input2.mat input4.mat product.bat product.exe.manifest product.subinput1.mat input3.mat product product.exe product.m[smithic@ulgp5 matlab]$ cat product #display file contentsexecutable=product.exeindexed_input_files=input.matindexed_output_files=output.mattotal_jobs=5

Page 40: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 matlab]$ lsinput0.mat input2.mat input4.mat product.bat product.exe.manifest product.subinput1.mat input3.mat product product.exe product.m[smithic@ulgp5 matlab]$ cat product #display file contentsexecutable=product.exeindexed_input_files=input.matindexed_output_files=output.mattotal_jobs=5[smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobsSubmitting job(s).....5 job(s) submitted to cluster 511.

Page 41: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 matlab]$ lsinput0.mat input2.mat input4.mat product.bat product.exe.manifest product.subinput1.mat input3.mat product product.exe product.m[smithic@ulgp5 matlab]$ cat product #display file contentsexecutable=product.exeindexed_input_files=input.matindexed_output_files=output.mattotal_jobs=5[smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobsSubmitting job(s).....5 job(s) submitted to cluster 511.[smithic@ulgp5 matlab]$ condor_q #get status of jobs-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.1 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.2 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.3 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.4 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc5 jobs; 0 idle, 5 running, 0 held

Page 42: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still

running-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc

1 jobs; 0 idle, 1 running, 0 held

Page 43: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still

running-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc

1 jobs; 0 idle, 1 running, 0 held[smithic@ulgp5 matlab]$ condor_q #all jobs complete -- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

Page 44: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still

running-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc

1 jobs; 0 idle, 1 running, 0 held[smithic@ulgp5 matlab]$ condor_q #all jobs complete -- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held[smithic@ulgp5 matlab]$ ls #check output filesinput0.mat input3.mat output1.mat output4.mat product.exe

product.subinput1.mat input4.mat output2.mat product product.exe.manifestinput2.mat output0.mat output3.mat product.bat product.m

Page 45: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Job submission example[smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still

running-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc

1 jobs; 0 idle, 1 running, 0 held[smithic@ulgp5 matlab]$ condor_q #all jobs complete -- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held[smithic@ulgp5 matlab]$ lsinput0.mat input3.mat output1.mat output4.mat product.exe

product.subinput1.mat input4.mat output2.mat product product.exe.manifestinput2.mat output0.mat output3.mat product.bat product.m[smithic@ulgp5 matlab]$ zip output.zip output*.mat #bundle output files

Page 46: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Summary

Condor can speed up processing by running large numbers of jobs in parallel

shorter jobs work best but can deal with jobs of arbitrary length

user-written codes easiest to run (MATLAB, R, C/C++, FORTRAN etc)

commercial 3rd party software may work

needs to run on standard MWS PC without user interaction

all Condor jobs submitted via central UNIX server

Page 47: Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Further Information

Condor

http://www.liv.ac.uk/e-science/condor

[email protected]

other research computing services

http://www.liv.ac.uk/csd/research/

[email protected]