common practices for managing small hpc clusters supercomputing 12 [email protected]...

57
Common Practices for Managing Small HPC Clusters Supercomputing 12 [email protected] [email protected]

Upload: joshua-fields

Post on 13-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Common Practices for Managing Small HPC Clusters

Supercomputing [email protected]

[email protected]

Page 2: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Small HPC BoF @ Supercomputing

• 2012 Survey Instrument: tinyurl.com/smallHPC

• Fourth annual BoF: https://sites.google.com/site/smallhpc/

• Discussing 2011 and 2012 survey results…

Page 3: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

GPUs

Page 4: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How many GPU cores does this cluster have?

Answer 2011 2012None 61% 41%1-999 33% 45%1,000 - 4,999 6% 5%5,000 - 9,999 0% 0%10,000 - 24,999 0% 5%25,000 or more 0% 5%

Page 5: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How many GPU cores does this cluster have?

Answer 2011 2012None 61% 41%1-999 33% 45%1,000 - 4,999 6% 5%5,000 - 9,999 0% 0%10,000 - 24,999 0% 5%25,000 or more 0% 5%

20% decrease in those replying “None”

Page 6: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Hardware Configuration

Page 7: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How many CPU cores are in this HPC cluster?

2011 %< 200 21%200 - 1000 16%1001 - 2000 26%2001 - 4000 26%> 4000 11%

2012 %< 200 14%200 - 999 14%

1,000- 4,999 36%

5,000 - 9,999 23%

10,000 + 14%

Page 8: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How many CPU cores are in this HPC cluster?

2011 %< 200 21%200 - 1000 16%1001 - 2000 26%2001 - 4000 26%> 4000 11%

2012 %< 200 14%200 - 999 14%

1,000- 4,999 36%

5,000 - 9,999 23%

10,000 + 14%Shift toward greaternumbers of cores

Page 9: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How much memory is there per physical server, i.e., per compute node?

Answer %0 - 8 Gb 16%9 - 16 Gb 32%17 - 24 Gb 11%25 - 32 Gb 21%33 - 64 Gb 58%> 64 Gb 37%

Unsure 5%

Answer %0 - 8 Gb 9%9 - 16 Gb 23%17 - 24 Gb 41%25 - 32 Gb 14%33 - 64 Gb 32%65 - 128 Gb 23%129 - 256 Gb 18%> 256 Gb 23%Unsure 0%

Page 10: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How much memory is there per physical server, i.e., per compute node?

Answer %0 - 8 Gb 16%9 - 16 Gb 32%17 - 24 Gb 11%25 - 32 Gb 21%33 - 64 Gb 58%> 64 Gb 37%

Unsure 5%

Answer %0 - 8 Gb 9%9 - 16 Gb 23%17 - 24 Gb 41%25 - 32 Gb 14%33 - 64 Gb 32%65 - 128 Gb 23%129 - 256 Gb 18%> 256 Gb 23%Unsure 0%

Page 11: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How much memory is there per core?Answer 20120 - .5 Gb 0%.6 - 1 Gb 5%1.1 - 2 Gb 32%2.1 - 4 Gb 50%4.1 - 6 Gb 9%6.1 - 8 Gb 23%8.1 - 16 Gb 23%> 16 Gb 18%Unsure 0%

Page 12: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How much memory is there per core?Answer 20120 - .5 Gb 0%.6 - 1 Gb 5%1.1 - 2 Gb 32%2.1 - 4 Gb 50%4.1 - 6 Gb 9%6.1 - 8 Gb 23%8.1 - 16 Gb 23%> 16 Gb 18%Unsure 0%

Long tail toward the high side

Page 13: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What low latency network for MPI communication among CPU cores?

Answer 2011 2012Infiniband 74% 77%10 Gigabit Ethernet, 10 GbE 11% 14%

1 Gigabit Ethernet, 1 GbE 11% 9%

Unsure 5% 0%

Page 14: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What low latency network for MPI communication among CPU cores?

Answer 2011 2012Infiniband 74% 77%10 Gigabit Ethernet, 10 GbE 11% 14%

1 Gigabit Ethernet, 1 GbE 11% 9%

Unsure 5% 0%

Little Change

Page 15: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Would you choose this form of low latency networking again?

2011 2012Yes 94% 86%No 6% 14%

Page 16: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Would you choose this form of low latency networking again?

2011 2012Yes 94% 86%No 6% 14%

Reasons given by the 14% who said No:

• Infiniband next time• 1 Gigabit Ethernet is actually 2gb network, and vendor

no longer viable for low-latency bandwith• 10 Gigabit Ethernet vendor no longer applicable in this

area• Will move to 10Gb in the near future

Page 17: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Scheduler Questions

Page 18: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What is the scheduler?Answer 2011 2012PBS 6% 9%TORQUE 33% 59%SGE (Oracle/Sun Grid Engine) 28% 23%LSF 33% 5%Lava 0% 0%Condor 6% 0%Maui/Moab 39% 55%SLURM 0% 9%Unsure --- 0%Other 0% 5%

Page 19: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What is the scheduler?Answer 2011 2012PBS 6% 9%TORQUE 33% 59%SGE (Oracle/Sun Grid Engine) 28% 23%LSF 33% 5%Lava 0% 0%Condor 6% 0%Maui/Moab 39% 55%SLURM 0% 9%Unsure --- 0%Other 0% 5%

Page 20: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Would you choose this scheduler again?

2011 2012Yes 94% 79%No 6% 21%

Page 21: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Would you choose this scheduler again?

2011 2012Yes 94% 79%No 6% 21%

Those who said No, commented:

• Documentation has been shaky after Oracle's acquisition of SUN. The spinoff, various versions, invalid links to documentation made it quite challenging.

• Queue based preemption caused problems (Oracle/SGE)• volume of jobs being submitted causing issues with load

on open source scheduler. upgrading (Maui/Moab)• SGE is no longer supported. Looking to use UGE

Page 22: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Do you allow serial jobs on this cluster?

2011 2012Yes 95% 95%No 5% 5%

Page 23: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Do you allow serial jobs on this cluster?

2011 2012Yes 95% 95%No 5% 5%

No Change

Page 24: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What is the maximum amount of memory allowed per serial job?

2011 %No maximum enforced 16%

< 2 GB 32%3 - 7 GB 11%8 -15 GB 21%16 - 24 GB 58%> 24 GB 37%

2012 %No maximum enforced 9%

2 GB or less 23%3 - 8 GB 41%9 -16 GB 14%17 - 24 GB 32%More than 24 GB 23%

Page 25: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What is the maximum amount of memory allowed per serial job?

2011 %No maximum enforced 16%

< 2 GB 32%3 - 7 GB 11%8 -15 GB 21%16 - 24 GB 58%> 24 GB 37%

2012 %No maximum enforced 9%

2 GB or less 23%3 - 8 GB 41%9 -16 GB 14%17 - 24 GB 32%More than 24 GB 23%

Decreases at the high end

Page 26: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What is the maximum amount of memory allowed per multi-core (mp or mpi) job?

2011 %No maximum enforced

74%

< 8 GB 0%8 - 15 GB 5%16 - 31 GB 0%32 - 48 GB 11%> 48 GB 11%

2012 %No maximum enforced

59%

8 GB or less 0%9 - 16 GB 0%17 - 32 GB 9%33 - 48 GB 0%More than 48 GB 32%

Page 27: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What is the maximum amount of memory allowed per multi-core (mp or mpi) job?

2011 %No maximum enforced

74%

< 8 GB 0%8 - 15 GB 5%16 - 31 GB 0%32 - 48 GB 11%> 48 GB 11%

2012 %No maximum enforced

59%

8 GB or less 0%9 - 16 GB 0%17 - 32 GB 9%33 - 48 GB 0%More than 48 GB 32%

More maximums enforced, but maximums are high.

Page 28: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Storage

Page 29: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Where do users' home directories reside?

Answer 2011 2012Local disk 0% 0%NFS 47% 55%Parallel file system(e.g., Luster or Panasas) 47% 41%

Unsure 0% 0%Other 5% 5%

Page 30: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Where do users' home directories reside?

Answer 2011 2012Local disk 0% 0%NFS 47% 55%Parallel file system(e.g., Luster or Panasas) 47% 41%

Unsure 0% 0%Other 5% 5%

Little Change

Page 31: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Would you configure users' home directories this way again?

2011 2012Yes 88% 100%No 12% 0%

Page 32: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Would you configure users' home directories this way again?

2011 2012Yes 88% 100%No 12% 0%

People are satiisfed with what they are doing

Page 33: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What type of high performance storage/scratch space?

Answer 2011 2012Local disk 26% 27%NFS 16% 27%Parallel file system (e.g., Luster or Panasas) 100% 100%

Unsure 0% 0%Other 0% 9% GPFS

Page 34: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

What type of high performance storage/scratch space?

Answer 2011 2012Local disk 26% 27%NFS 16% 27%Parallel file system (e.g., Luster or Panasas) 100% 100%

Unsure 0% 0%Other 0% 9% GPFS

Little Change

Page 35: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Would you configure high performance/scratch space this way again?

2011 2012Yes 100% 95%No 0% 5%

Page 36: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Would you configure high performance/scratch space this way again?

2011 2012Yes 100% 95%No 0% 5%

Essentially no change

Page 37: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Do you have an online, medium performance data storage service?

2011 2012Yes 28% 55%No 72% 45%

Unsure 0% 0%

Page 38: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Do you have an online, medium performance data storage service?

2011 2012Yes 28% 55%No 72% 45%

Unsure 0% 0%

Page 39: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Which of the following storage environments on this cluster do you back up?

Answer 2011 2012Home directories 53% 68%High performance / scratch space 5% 9%

Medium performance, online storage 11% 45%

None 47% 27%Unsure 0% 0%

Page 40: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Which of the following storage environments on this cluster do you back up?

Answer 2011 2012Home directories 53% 68%High performance / scratch space 5% 9%

Medium performance, online storage 11% 45%

None 47% 27%Unsure 0% 0%

Page 41: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Current Directions

Page 42: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

If you were buying new compute nodes today, how many cores per node?

2011 %4 0%8 5%12 21%16 32%> 16 16%

Unsure 26%

2012 %4 0%8 9%12 5%16 50%24 0%32 14%Unsure 14%Other 9%

Page 43: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

If you were buying new compute nodes today, how many cores per node?

2011 %4 0%8 5%12 21%16 32%> 16 16%

Unsure 26%

2012 %4 0%8 9%12 5%16 50%24 0%32 14%Unsure 14%Other 9%

Page 44: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

If you were buying new compute nodes today, how much memory per node?

2011 %0-8 GB 0%9-16 GB 0%17-24 GB 0%25-48 GB 41%>48 GB 35%

Unsure 24%

2012 %0-8 GB 9%9-16 GB 9%17-24 GB 0%25-48 GB 23%49-64 GB 9%More than 64 GB 41%Unsure 9%

Page 45: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

If you were buying new compute nodes today, how much memory per node?

2011 %0-8 GB 0%9-16 GB 0%17-24 GB 0%25-48 GB 41%>48 GB 35%

Unsure 24%

2012 %0-8 GB 9%9-16 GB 9%17-24 GB 0%25-48 GB 23%49-64 GB 9%More than 64 GB 41%Unsure 9%

Page 46: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Staffing

Page 47: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How many different individuals, excl. students, involved in operation, support, development?

Answer 20121 individual 5%2-3 individuals 41%4-5 individuals 32%6-8 individuals 18%9-10 individuals 5%11-15 individuals 0%More than 15 individuals 0%

Page 48: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

How many different individuals, excl. students, involved in operation, support, development?

Answer 20121 individual 5%2-3 individuals 41%4-5 individuals 32%6-8 individuals 18%9-10 individuals 5%11-15 individuals 0%More than 15 individuals 0%

Page 49: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Approximately how many FTE, incl. students, operate the cluster to maintain the status quo (excluding user support)?

Answer 2012< 1 FTE 27%1.1 – 2 FTE 41%2.1 – 4 FTE 23%4.1 – 6 FTE 9%6.1 – 8 FTE 0%More than 8 FTE 0%

Page 50: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Approximately how many FTE, incl. students, operate the cluster to maintain the status quo (excluding user support)?

Answer 2012< 1 FTE 27%1.1 – 2 FTE 41%2.1 – 4 FTE 23%4.1 – 6 FTE 9%6.1 – 8 FTE 0%More than 8 FTE 0%

Page 51: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Approximately how many FTE, incl. students, support users of the cluster?

Answer 2012< 1 FTE 32%1.1 – 2 FTE 27%2.1 – 4 FTE 27%4.1 – 6 FTE 9%6.1 – 8 FTE 5%More than 8 FTE 0%

Page 52: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Approximately how many FTE, incl. students, support users of the cluster?

Answer 2012< 1 FTE 32%1.1 – 2 FTE 27%2.1 – 4 FTE 27%4.1 – 6 FTE 9%6.1 – 8 FTE 5%More than 8 FTE 0%

Generally fewer FTE than for operations support

Page 53: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Approximately how many FTE, incl. students, are involved in hardware/software development efforts related to the cluster?

Answer 2012< 1 FTE 48%1.1 – 2 FTE 33%2.1 – 4 FTE 14%4.1 – 6 FTE 5%6.1 – 8 FTE 0%More than 8 FTE 0%

Page 54: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Approximately how many FTE, incl. students, are involved in hardware/software development efforts related to the cluster?

Answer 2012< 1 FTE 48%1.1 – 2 FTE 33%2.1 – 4 FTE 14%4.1 – 6 FTE 5%6.1 – 8 FTE 0%More than 8 FTE 0%

Even fewer FTE than for user support

Page 55: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Inward facing staff versus outward facing staff?

Answer 2012There is a clear separation. 9%

There is some separation. 41%

There is almost no separation. 50%

Page 56: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Inward facing staff versus outward facing staff?

Answer 2012There is a clear separation. 9%

There is some separation. 41%

There is almost no separation. 50%

Most staff do both operational and end user support

Page 57: Common Practices for Managing Small HPC Clusters Supercomputing 12 roger.bielefeld@cwru.edu david@uwm.edu

Small HPC BoFs Contact Information

Survey tinyurl.com/smallHPC

Website https://sites.google.com/site/smallhpc/

Email List See link at above website

Roger Bielefeld [email protected]

David Stack [email protected]