n ational e nergy r esearch s cientific c omputing c enter evolution of the nersc sp system nersc...

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

Evolution of the NERSC SP SystemNERSC User Services

Original Plans

Phase 1

Phase 2

Programming Models and Code Porting

Using the System


June 2000 SP Evolution - NERSC User Services 2

Original Plans: The NERSC-3 Procurement

• Complete, reliable, high-end scientific system

• High availability and MTBF

• Fully configured - processing, storage, software, networking, support

• Commercially available components

• The greatest amount of computational power for the money

• Can be integrated with existing computing environment

• Can be evolved with product line

• Much careful benchmarking and acceptance testing done



Original Plans: The NERSC-3 Procurement• What we wanted:

– >1 teraflop of peak performance

– 10 terabytes of storage

– 1 terabyte of memory

• What we got in phase 1– 410 gigaflops of peak performance


– 512 gigabytes of memory

• What we will get in phase 2– 3 teraflops of peak performance


– 1 terabyte of memory



Hardware, Phase 1• 304 Power 3+ nodes: Nighthawk 1

– Node usage:• 256 compute/batch nodes = 512 CPUs

• 8 login nodes = 16 CPUs

• 16 GPFS nodes = 32 CPUs

• 8 network nodes = 16 CPUs

• 16 service nodes = 32 CPUs

– 2 processors/node

– 200 MHz clock

– 4 flops/clock (2 multiply-add ops) = 800 Mflops/CPU, 1.6 Gflops/node

– 64 KB L-1 d-cache per CPU @ 5 nsec & 3.2 GB/sec

– 4 MB L-2 cache per CPU @ 45 nsec & 6.4 GB/sec

– 1 GB RAM per node @ 175 nsec & 1.6 GB/sec

– 150 MB/sec switch bandwidth

– 9 GB local disk (two-way RAID)



Hardware, Phase 2• 152 Power 3+ nodes: Winterhawk 2

– Node usage:• 128 compute/batch nodes = 2048 CPUs

• 2 login nodes = 32 CPUs

• 16 GPFS nodes = 256 CPUs

• 2 network nodes = 32 CPUs

• 4 service nodes = 64 CPUs

– 16 processors/node

– 375 MHz clock

– 4 flops/clock (2 multiply-add ops) = 1.5 Gflops/CPU, 22.4 Gflops/node

– 64 KB L-1 d-cache per CPU @ 5 nsec & 3.2 GB/sec

– 8 MB L-2 cache per CPU @ 45 nsec & 6.4 GB/sec

– 8 GB RAM per node @ 175 nsec & 14.0 GB/sec

– ~2000 (?) MB/sec switch bandwidth

– 9 GB local disk (two-way RAID)



Programming Models, phase 1

• Phase 1 will reply on MPI, with availability of threading– OpenMP directives

– Pthreads

– IBM SMP directives

• MPI now does intra-node communications efficiently

• Mixed-model programming not currently very advantageous

• PVM and LAPI messaging systems are also available

• SHMEM is “planned”…

• The SP has cache and virtual memory, which means– There are more ways to reduce code performance

– There are more ways to lose portability



Programming Models, phase 2

• Phase 2 will offer more payback for mixed model programming– Single node parallelism is a good target for PVP users

– Vector and shared-memory codes can be “expanded” into MPI

– MPI codes can be ported from the T3E

– Threading can be added within MPI

• In both cases, re-engineering will be required, to exploit new and different levels of granularity

• This can be done along with increasing problem sizes



Porting Considerations, part 1• Things to watch out for in porting codes to the SP

– Cache• Not enough on the T3E to make worrying about it worth the trouble

• Enough on the SP to boost performance, if it’s used well

• Tuning for cache is different than tuning for vectorization

• False sharing caused by cache can reduce perfomrance

– Virtual memory• Gives you access to 1.75 GB of (virtual) RAM address space

• To use all of virtual (or even real) memory, must explicitly request “segments”

• Causes performance degradation due to paging

– Data types• Default sizes are different on PVP, T3E, and SP systems

• “integer”, “int”, “real”, and “float” must be used carefully

• Best to say what you mean: “real*8”, integer*4”

• Do the same in MPI calls: “MPI_REAL8”, “MPI_INTEGER4”

• Be careful with intrinsic function use, as well



Porting Considerations, part 2• More things to watch out for in porting codes to the SP

– Arithmetic• Architecture tuning can help exploit special processor instructions

• Both T3E and SP can optimize beyond IEEE arithmetic

• T3E and PVP can also do fast reduced precision arithmetic

• Compiler options on T3E and SP can force IEEE compliance

• Compiler options can also throttle other optimizations for safety

• Special libraries offer faster intrinsics

– MPI• SP compilers and runtime will catch loose usage that was accepted on the T3E

• Communication bandwidth on SP Phase 1 is lower than on the T3E

• Message latency on the SP Phase 1 is higher than on the T3E

• We expect approximate parity with T3E in these areas, on the Phase 2 system

• Limited number of communication ports per node - approximately one per CPU

• “Default” versus “eager” buffer management in MPI_SEND



Porting Considerations, part 3• Compiling & linking

– “Version” is dependent on language and parallelization scheme• Language version

– Fortran 77: f77, xlf– Fortran 90: xlf90– Fortran 95: xlf95– C: cc, xlc, c89– C++: xlC

• MPI-included: mpxlf, mpxlf90, mpcc, mpCC• Thread-safe: xlf_r, xlf90_r, xlf95_r, mpxlf_r, mpxlf90_r

– Preprocessing can be ordered by compiler flag or source file suffix

• Use consistently, for all related compilations; the following may NOT produce a parallel executable:mpxlf90 -c *.F

xlf90 -o foo *.o

• Use -bmaxdata:bytes option to get more than a single 256 MB segment (up to 7 segments, or ~1.75 GB can be specified; only 3, or 0.75 GB, are real)



Porting: MPI• MPI codes should port relatively well

• Use one MPI task per node or processor

– One per node during porting

– One per processor during production

– Let MPI worry about where it’s communicating to

– Environment variables, execution parameters, and/or batch options can specify

• # tasks per node

• Total # tasks

• Total # processors

• Total # nodes

• Communications subsystem in use– User Space is best in batch jobs

– IP may be best for interactive developmental runs

• There is a debug queue/class in batch



Porting: Shared Memory

• Don’t throw away old shared memory directives– OpenMP will work as is

– Cray Tasking directives will be useful for documentation

– We recommend porting Cray directives to OpenMP

– Even small-scale parallelism can be useful

– Larger scale parallelism will be available next year

• If your problems and/or algorithms will scale to larger granularities and greater parallelism, prepare for message passing– We recommend MPI



From Loop-slicing to MPI, before...allocate(A(1:imax,1:jmax))

!OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, imax, jmax)

do I = 1, imax

do J = 1, jmax

A(I,J) = deep_thought(A, I, J,…)

enddo

enddo

Sanity checking– Run the program on one CPU to get baseline answers

– Run on several CPUs to see parallel speedups and answers

• Optimization– Consider changing memory access patterns to improve cache usage

– How big can your problem get before you run out of real memory?



From Loop-slicing to MPI, after...call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)

call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)

call my_indices(my_id, nprocs, my_imin, my_imax, my_jmin, my_jmax)

allocate(A(my_imin : my_imax, my_jmin : my_jmax))

!OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, my_imax, my_jmax, my_imax, my_jmax)

do I = my_imin, my_imax

do J = my_jmin, my_jmax

A(I,J) = deep_thought(A, I, J,…)

enddo

enddo

! Communicate the shared values with neighbors…

if(odd(my_ID)) then

call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)

call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)

call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)

call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)

else

call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)

call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)

call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)

call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)

endif



From Loop-slicing to MPI, after...• You now have one MPI task and many OpenMP threads per node

– The MPI task does all the communicating between nodes

– The OpenMP threads do the parallelizable work

– Do NOT use MPI within an OpenMP parallel region

• Sanity checking– Run on one node and one CPU to check baseline answers

– Run on one node and several CPUs to see parallel speedup and answers

– Run on several nodes, one CPU per node, and check answers

– Run on several nodes, several CPUs per node, and check answers

• Scaling checking– Run a larger version of a similar problem on the same set of ensemble sizes

– Run the same sized problem on a larger ensemble

• (Re-)Consider your I/O strategy…



From MPI to Loop-slicing

• Add OpenMP directives to existing code

• Perform sanity and scaling checks, as before

• Results in same overall code structure as on previous slides– One MPI task and several OpenMP threads per node

• For irregular codes, Pthreads may serve better, at the cost of increased complexity

• Nobody really expects it to be this easy...



Using the Machine, part 1

• Somewhat similar to the Crays– Interactive and batch jobs are possible

Class Max Nodes Max Processors Max Time Priority

debug 16 32 30 minutes 20000

premium 256 512 4 hours 10000

regular 256 512 4 hours 5000

low 256 512 4 hours 1

interactive 8 16 20 minutes 15000



Using the Machine , part 2

• Interactive runs– Sequential executions run immediately on your login node

– Every login will likely put you on a different node, so be careful about looking for your executions - “ps” returns info about only the node you’re logged into.

– Small scale parallel jobs may be rejected if LoadLeveler can’t find the resources

– There are two pools of nodes that can be used for interactive jobs:• Login nodes

• A small subset of the compute nodes

– Parallel execution can often be achieved by• Trying again, after initial rejection

• Changing communication mechanisms from User Space to IP

• Using the other pool




• Batch jobs– Currently, very similar in capability to the T3E

• Similar run times, processor counts

• More memory available on the SP

– Limits and capabilities may change, as we learn the machine

– LoadLeveler is similar to, but simpler than NQE/NQS on the T3E• Jobs are submitted, monitored, and cancelled by special commands

• Each batch job requires a script that is essentially a shell script

• The first few lines contain batch options that look like comments to the shell

• The rest of the script can contain any shell constructs

• Scripts can be debugged by executing them interactively

• Users are limited to 3 running jobs, 10 queued jobs, and 30 submitted jobs, at any given time




• File systems– Use the environment variables to let the system manage your file

usage

– Sequential work can be done in $HOME (not backed up) or $TMPDIR (transient)

• Medium performance, node-local

– Parallel work can be done in $SCRATCH (transient) or /scratch/username (purgeable)

• High performance, located in GPFS

– HPSS is available from batch jobs, via HSI, and interactively via FTP, PFTP, and HIS

– There are quotas on space and inode usage



Using the Machine , part 4• The future?

– The allowed scale of parallelism (CPU counts) may change• Max now = 512 CPUs, same as on T3E

– The allowed duration of runs may change• Max now = 4 hours; Max on T3E = 12 hours

– The size of possible problems will definitely change• More CPUs in phase 1 than the T3E

• More memory per cpu, in both phases, than on T3e

– The amount of work possible per unit time will definitely change• CPUs in both phases are faster than those on the T3E

• Phase 2 interconnect will be faster than on Phase 1

– Better machine management• Checkpointing will be available

• We will learn what can be adjusted in the batch system

– There will be more and better tools for monitoring and tuning• HPM, KAP, Tau, PAPI...

– Some current problems will go away (e.g. memory mapped files)

n ational e nergy r esearch s cientific c omputing c enter evolution of the nersc sp system nersc...

Documents