experimental comparative study of job management systems george washington university george mason...

52
xperimental Comparative Study of Job Management Systems George Washington Universit George Mason University http://ece.gmu.edu/lucite

Upload: blake-gray

Post on 13-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Experimental Comparative Study of Job Management Systems

George Washington UniversityGeorge Mason University

http://ece.gmu.edu/lucite

Page 2: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Outline:

1. Review of experiments

2. Results

3. Encountered problems

4. Functional comparison

5. Extension to reconfigurable hardware

Page 3: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Review of Experiments

Page 4: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

science.gmu.edu

Linux – PII,400 MHz, 128 MB RAM

Linux RH7.0 – PIII 450 MHz, 512 MB RAM

4 x Linux RH6.2 – 2xPIII – 500 MHz, 128MB

m1

pallj / m0

Solaris 8 – UltraSparcIIi,360 MHz, 512 MB RAM

m4 m5 m7

3 x Linux RH6.2 – 2xPIII – 450 MHz, 128MB

Solaris 8 – UltraSparcIIi,440 MHz, 512 MB RAM

Solaris 8 – UltraSparcIIi,440 MHz, 128 MB RAM

Solaris 8 – UltraSparcIIi,330 MHz, 128 MB RAM

palpc2

alicja

anna

magdalena

redfox

gmu.edu

Our Testbed

Page 5: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

* benchmarks used to determine the relative CPU factors of execution hosts

SHORT JOBS (1 s execution time 2 minutes)

No. Group Name Class Script name CPU time [s]

Memory Usage [MB]

Memory Requirements

[MB] 1 NPB FT S ft.S.sh 3.5 3.2 4 2 NPB FT W ft.W.sh 9.4 6.4 8 3 NPB MG W mg.W.sh 10.8 1.9 3 4 NPB EP S ep.S.sh 26.5 0.25 1 5 NPB EP W ep.W.sh 53.0 0.25 1 6 NPB IS W is.W.sh 1.0 1.7 3 7 NPB BT S bt.S.sh 3.0 2.5 3

8* NPB BT W bt.W.sh 115 17 21 9 NSA IS 7 mln radix.7M.sh 6 12.8 16

10 UPC Sobel 256 sobel.256.sh 4 0.4 1 11 UPC Sobel 512 sobel.512.sh 17 0.8 1

12* UPC Sobel 1024 sobel.1024.sh 68 2.4 3 13 UPC MM 512 matrix.1.sh 10.5 5.9 8 14 UPC MM 1024 matrix.2.sh 21 9.9 12 15 UPC MM 2048 matrix.3.sh 40 18.4 23

Average 22.0

Page 6: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

 Machine names Host Type Host Model CPU Factor m1-m4 Linux PIII_2_500_128 1.65m5-m7 Linux PIII_2_450_128 1.55pallj Linux PIII_1_450_512 1.60palpc2 Linux P2_1_400_128 1.70alicja Solaris64 USIIi_1_360_512 1.0anna Solaris64 USIIi_1_440_128 1.2magdalena Solaris64 USIIi_1_440_512 1.2redfox Solaris64 USIIi_1_330_128 1.2

CPU factors for medium benchmark listbased on the execution time for bt.W and Sobel1024i

Page 7: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

No. Group Name Class Script name CPU time [min:s]

Memory Usage [MB]

Memory Requirements

[MB] 1 NPB EP A ep.A.sh 7:45 1.3 3 2 NPB LU W lu.W.sh 8:09 6.8 9 3* NPB SP W sp.W.sh 6:07 15.1 19 4 Crypto Mars M crypto.mars.M.sh 9:21 0.4 1 5 Crypto RC6 M crypto.rc6.M.sh 6:21 0.4 1 6 Crypto Rijndael M crypto.rijndael.M.sh 4:11 0.4 1 7 Crypto Serpent M crypto.serpent.M.sh 8:54 0.4 1 8* Crypto Twofish M crypto.twofish.M.sh 8:05 0.4 1

Average 7:22

MEDIUM JOBS (2 minutes execution time 10 minutes)

* benchmarks used to determine the relative CPU factors of execution hosts

Page 8: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

No. Group Name Class Script name CPU time [min:s]

Memory Usage [MB]

Memory Requirements

[MB] 1* NPB EP B ep.B.sh 30:15 5 6 2 Crypto Mars L crypto.mars.L.sh 14:55 0.4 1 3 Crypto RC6 L crypto.rc6.L.sh 10:07 0.4 1 4 Crypto Rijndael L crypto.rijndael.L.sh 10:58 0.4 1 5 Crypto Serpent L crypto.serpent.L.sh 14:09 0.4 1 6* Crypto Twofish L cryto.twofish.L.sh 20:45 0.4 1

Average 16:51

LONG JOBS (10 minutes execution time 30 minutes)

* benchmarks used to determine the relative CPU factors of execution hosts

Page 9: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

No. Group Name Class Script name CPU time

[min:s]

Memory Usage [MB]

Memory Requirements [MB]

Input files

Output file

1 NPB FT S ft.S.io.sh 0:04 3.2 4 fft_64.in_pc fft_64.in_sun

fft_64.out_pc fft_64.out_sun

2 NPB FT W ft.W.io.sh 0:10 6.4 8 fft_128.in_pc fft_128.in_sun

fft_128.out_pc fft_128.out_sun

3 UPC MM 512 matrix.1.io.sh 0:11 5.9 8 mat_512.in_pc mat_512.in_sun

mat_512.out_pc mat_512.out_sun

4 UPC MM 1024 matrix.2.io.sh 0:21 9.9 12 mat_1024.in_pc mat_1024.in_sun

mat_1024.out_pc mat_1024.out_sun

5 UPC MM 2048 matrix.3.io.sh 0:40 18.4 23 mat_2048.in_pc mat_2048.in_sun

mat_2048.out_pc mat_2048.out_sun

6 NPB LU W lu.W.io.sh 8:09 6.8 9 - LU_W.out Average 1:36

INPUT/OUTPUT JOBS (1 second execution time 10 minutes)

Page 10: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Typical experiment

time

Job submissions

time

1 N

i1 iN

time=0

Jobs finishing execution

Total time of an experiment 2 hours

N= 150 for medium and small jobs75 for long jobs

Pseudorandom delays between consecutivejob submissions

Poisson distribution of the job submission rate

Page 11: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Experi-ment

Number

Benchmark Set

Average CPU time /

Job

Average Time Intervals

Between Job Submissions

Total Number of Jobs

Special Assumptions

1 Set 2, Medium job list

7 min 22 s 30 s, 15 s, 5 s 150 one job / CPU

2 Set 2, Medium job list

7 min 22 s 15 s 150 two jobs / CPU

3 Set 3, Long job list

16 min 51 s 2 min, 30 s 75 one job / CPU

4 Set 1, Short job list

22 s 15 s, 10 s, 5 s 150 one job / CPU

5 Set 4, I/O job list

1 min 36 s 15 s 150 one job / CPU

List of experiments

Page 12: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

time

ts

submissiontime

tb

begin of executiontime

te

end of executiontime

td

deliverytime

TR

responsetime

TTA

turn aroundtime

TEXE

executiontime

TD

deliverytime

Definition of timing parameters

Page 13: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

time

ts

submissiontime

tb

begin of executiontime

te

end of executiontime

TR

responsetime

TTA

turn aroundtime

TEXE

executiontime TD=0

delivery time=0

Typical scenario

determined using the gettimeofday() function

Page 14: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Total Throughput

time

Job submissions

time

1 N

i1 iN

time=0

Jobs finishing execution

TN – time necessary to execute N jobs

Total Throughput = N

TN

Page 15: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Partial Throughput

time

Job submissions

time

1 N

i1 iN

time=0

Jobs finishing execution

Tk – time necessary to execute k jobs

Throughput (k) = k

Tk

ik

Page 16: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

machine 2

machine M

machine 1

0%

100%CPU utilization

average CPU utilization

0%

100%CPU utilization

average CPU utilization

0%

100%CPU utilization

average CPU utilization

. . . . . . . . . . . . .

job1 job2 job3

job1 job2

job2

job1 job3

Uavr

1

Uavr

2

Uavr

M

M

1j

avrjU

M

1 UOverall utilization =

Utilization

Page 17: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Results

Page 18: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

20

40

60

80

100

120

2 jobs/min 4 jobs/min 12 jobs/minAverage job submission rate

Medium jobs – Total ThroughputThroughput [jobs/hour]

LSFPBS

CodineCondor

7670

68

79

97 91

82

114107

102

86

110

Page 19: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

500

1000

1500

2000

2500

2 jobs/min 4 jobs/min 12 jobs/min

Medium jobs – Turn-around Time

LSFPBSCodineCondor

Average job submission rate

Turn-around Time [s]

496 462607

505

1134

944

12931148

1765

1466

1949

1627

Page 20: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

200

400

600

800

1000

1200

1400

1600

2 jobs/min 4 jobs/min 12 jobs/minAverage job submission rate

Medium jobs – Response TimeResponse Time [s]

LSFPBSCodineCondor

13 3 31 28

636

452

734671

1274

984

1385

1156

Page 21: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

10

20

30

40

50

60

70

80

90

2 jobs/min 4 jobs/min 12 jobs/minAverage job submission rate

Medium jobs – UtilizationUtilization [%]

LSFPBS

CodineCondor

54

41

70

61 6357

71 74 7367

78

69

Page 22: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

5

10

15

20

25

30

35

40

45

0.5 job/min 2 jobs/minAverage job submission rate

Long jobs – Total ThroughputThroughput [jobs/hour]

LSFPBS

CodineCondor

25 26

18

40

2830

23

42

Page 23: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

500

1000

1500

2000

2500

3000

3500

4000

0.5 job/min 2 jobs/minAverage job submission rate

Long jobs – Turn-around TimeTurn-around Time [s]

LSFPBS

CodineCondor

1148 1079

1903 19262191 2163

3401

2357

Page 24: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

200

400

600

800

1000

1200

1400

1600

0.5 job/min 2 jobs/minAverage job submission rate

Long jobs – Response TimeResponse Time [s]

LSFPBS

CodineCondor

13 3 3

721

860799

1478

1225

Page 25: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

10

20

30

40

50

60

70

80

0.5 job/min 2 jobs/minAverage job submission rate

Long jobs – UtilizationUtilization [%]

LSFPBS

CodineCondor

4346

52

24

5658

64

69

Page 26: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

200

400

600

800

1000

1200

1400

4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min

Average job submission rate

Short jobs – Total ThroughputThroughput [jobs/hour]

LSFPBS

CodineCondor

240227

234

160

356322 337

205

652

414

607

280

1076

576

336

1255

642

370

1027

1210

Page 27: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

20

40

60

80

100

120

140

4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min

Average job submission rate

LSFPBS

CodineCondor

Short jobs – Turn-around TimeTurn-around Time [s]

42

3429

50

41

33 29

51

42

58

29

51

68

58

31

52

120

62

32

50

Page 28: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

10

20

30

40

50

60

70

80

90

4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min

Average job submission rate

LSFPBS

CodineCondor

Short jobs – Response TimeResponse Time [s]

9

2 1

19

9

3 1

19

9 8

1

17

32

8

2

18

83

9

2

18

Page 29: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

0

5

10

15

20

25

30

35

40

45

4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min

Average job submission rate

LSFPBS

CodineCondor

Short jobs – UtilizationUtilization [%]

9

18

6 6

15

21

98

20

35

16

10

26

38

12

3738

12

32

37

Page 30: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Medium jobs – Total ThroughputThroughput [jobs/hour]

0

20

40

60

80

100

120

1 job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min

Maximum number of jobs per CPU

LSFPBS

CodineCondor

9791

82

114

90

80

67

105

Page 31: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Medium jobs – Turn-around TimeTurn-around Time [s]

0

200

400

600

800

1000

1200

1400

1600

1 job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min

Maximum number of jobs per CPU

LSFPBS

CodineCondor

1134

944

1293 1147

1297 1273

1482

969

Page 32: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Medium jobs – Response TimeResponse Time [s]

0

100

200

300

400

500

600

700

800

1 job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min

Maximum number of jobs per CPU

LSFPBS

CodineCondor636

452

734

671

387

285

386 386

Page 33: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Medium jobs – UtilizationUtilization [%]

0

10

20

30

40

50

60

70

80

1 job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min

Maximum number of jobs per CPU

LSFPBS

CodineCondor

63

57

7174

6358

6354

Page 34: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Encountered problems

Page 35: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

1. Jobs with high requirements on the stack size

Indication: Certain jobs do not finish execution when run under LSF. The same jobs run correctly outside of any JMS, and under other job management systems

Source: Variable STACKLIMIT in $LSB_CONFDIR/<cluster_name>/configdir/lsb.queues

Remaining Problem: Documentation of default limits.

Page 36: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

2. Frequently submitted small jobs

Indication: Unexpectedly high response time and turn-around time for a medium job submission rate

Possible solution: Defining variable CHUNK_JOB_SIZE (e.g., =5) in lsb.queues, and the variable LSB_CHUNK_NORUSAGE=y in lsf.conf

Page 37: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

3. Ordering of machines fulfilling resource requirements

Question: How many machines are dropped from the list based on the first ordering?

Default:

r1m : pg

Page 38: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

4. Random behavior from iteration to iteration

Question: Why is r1m different each time?

Indication: Assignment of jobs to particular machines is different in each iteration of the experiment

Page 39: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

5. Boundary effects in the calculation of the throughput

Question: How to define the steady state throughput?

Indication: Steady state partial throughput different than the total throughput

Page 40: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

6. Throughput vs. turn-around time

Question: How to explain the lack of this correlation?

Indication: No correlation between the ranking of JMSes in terms of the throughput and in terms of the turn-around time

Page 41: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Functional comparison

Page 42: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Operating system, flexibility, user interface

LSF Codine PBS CONDOR RES

Distribution

Source code

OS Support

User Interface

SolarisLinuxTru64NT

GUI &CLI

CLI

com pub pub/com pub gov

GUI &CLI

GUI &CLI

GUI &CLI

Page 43: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Scheduling and Resource Management

LSF Codine PBS CONDOR RES

Batch jobs

Interactive jobs

Parallel jobs

Accounting

Page 44: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Efficiency and Utilization

LSF Codine PBS CONDOR RES

Stage-in andstage-out

Timesharing

Process migration

Dynamic loadbalancing

Scalability

Page 45: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Fault Tolerance and Security

LSF Codine PBS CONDOR RES

Checkpointing

Daemon fault recovery

Authentication

Authorization

Page 46: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Documentation and Technical Support

LSF Codine PBS CONDOR RES

Documentation

Technicalsupport

Page 47: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

JMS features supporting extension to reconfigurable hardware

• capability to define new dynamic resources

• strong support for stage-in and stage-out- configuration bitstreams- executable code- input/output data

• support for Windows NT and Linux

Page 48: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Ranking of Centralized Job Management Systems (1)

Capability to define new dynamic resources:

Excellent: LSF, PBS, CODINEMore difficult: CONDOR, RES

Stage-in and stage-out:

Excellent: LSF, PBSLimited: CONDORNo: CODINE, RES

Page 49: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Ranking of Centralized Job Management Systems (2)

Overall suitability to extend to reconfigurable hardware:

1. LSF2. CODINE3. PBS4. CONDOR5. RES

without changing the JMS source code

requires changes to the JMS source code

Page 50: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Extension to reconfigurablehardware

Page 51: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Submission host

LIM

Batch API

Master host

MLIM

MBD

Execution host

SBD

Child SBD

LIM

RES

User job

Extension of LSF to reconfigurable hardware (1)Operation of LSF

LIM – Load Information ManagerMLIM – Master LIMMBD – Master Batch DaemonSBD – Slave Batch DaemonRES – Remote Execution Server

queue1

2

3

45

6 7

89

10

11

12

13

Loadinformation

otherhosts

otherhosts

bsub app

Page 52: Experimental Comparative Study of Job Management Systems George Washington University George Mason University

Extension of LSF to reconfigurable hardware(2)

Submission host

LIM

Batch API

Master host

MLIM

MBD

Execution host

SBD

Child SBD

LIM

RES

User job

ELIM – External Load Information ManagerACS API – Adaptive Computing Systems API

queue1

2

3

45

6 7

89

10

11

12

13

Loadinformation

otherhosts

otherhosts

bsub app

ELIM

ACS API

14FPGAboard

Statusof theboard