n ational e nergy r esearch s cientific c omputing c enter evolution of the nersc sp system nersc...
TRANSCRIPT
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
Evolution of the NERSC SP SystemNERSC User Services
Original Plans
Phase 1
Phase 2
Programming Models and Code Porting
Using the System
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 2
Original Plans: The NERSC-3 Procurement
• Complete, reliable, high-end scientific system
• High availability and MTBF
• Fully configured - processing, storage, software, networking, support
• Commercially available components
• The greatest amount of computational power for the money
• Can be integrated with existing computing environment
• Can be evolved with product line
• Much careful benchmarking and acceptance testing done
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 3
Original Plans: The NERSC-3 Procurement• What we wanted:
– >1 teraflop of peak performance
– 10 terabytes of storage
– 1 terabyte of memory
• What we got in phase 1– 410 gigaflops of peak performance
– 10 terabytes of storage
– 512 gigabytes of memory
• What we will get in phase 2– 3 teraflops of peak performance
– 15 terabytes of storage
– 1 terabyte of memory
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 4
Hardware, Phase 1• 304 Power 3+ nodes: Nighthawk 1
– Node usage:• 256 compute/batch nodes = 512 CPUs
• 8 login nodes = 16 CPUs
• 16 GPFS nodes = 32 CPUs
• 8 network nodes = 16 CPUs
• 16 service nodes = 32 CPUs
– 2 processors/node
– 200 MHz clock
– 4 flops/clock (2 multiply-add ops) = 800 Mflops/CPU, 1.6 Gflops/node
– 64 KB L-1 d-cache per CPU @ 5 nsec & 3.2 GB/sec
– 4 MB L-2 cache per CPU @ 45 nsec & 6.4 GB/sec
– 1 GB RAM per node @ 175 nsec & 1.6 GB/sec
– 150 MB/sec switch bandwidth
– 9 GB local disk (two-way RAID)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 5
Hardware, Phase 2• 152 Power 3+ nodes: Winterhawk 2
– Node usage:• 128 compute/batch nodes = 2048 CPUs
• 2 login nodes = 32 CPUs
• 16 GPFS nodes = 256 CPUs
• 2 network nodes = 32 CPUs
• 4 service nodes = 64 CPUs
– 16 processors/node
– 375 MHz clock
– 4 flops/clock (2 multiply-add ops) = 1.5 Gflops/CPU, 22.4 Gflops/node
– 64 KB L-1 d-cache per CPU @ 5 nsec & 3.2 GB/sec
– 8 MB L-2 cache per CPU @ 45 nsec & 6.4 GB/sec
– 8 GB RAM per node @ 175 nsec & 14.0 GB/sec
– ~2000 (?) MB/sec switch bandwidth
– 9 GB local disk (two-way RAID)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 6
Programming Models, phase 1
• Phase 1 will reply on MPI, with availability of threading– OpenMP directives
– Pthreads
– IBM SMP directives
• MPI now does intra-node communications efficiently
• Mixed-model programming not currently very advantageous
• PVM and LAPI messaging systems are also available
• SHMEM is “planned”…
• The SP has cache and virtual memory, which means– There are more ways to reduce code performance
– There are more ways to lose portability
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 7
Programming Models, phase 2
• Phase 2 will offer more payback for mixed model programming– Single node parallelism is a good target for PVP users
– Vector and shared-memory codes can be “expanded” into MPI
– MPI codes can be ported from the T3E
– Threading can be added within MPI
• In both cases, re-engineering will be required, to exploit new and different levels of granularity
• This can be done along with increasing problem sizes
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 8
Porting Considerations, part 1• Things to watch out for in porting codes to the SP
– Cache• Not enough on the T3E to make worrying about it worth the trouble
• Enough on the SP to boost performance, if it’s used well
• Tuning for cache is different than tuning for vectorization
• False sharing caused by cache can reduce perfomrance
– Virtual memory• Gives you access to 1.75 GB of (virtual) RAM address space
• To use all of virtual (or even real) memory, must explicitly request “segments”
• Causes performance degradation due to paging
– Data types• Default sizes are different on PVP, T3E, and SP systems
• “integer”, “int”, “real”, and “float” must be used carefully
• Best to say what you mean: “real*8”, integer*4”
• Do the same in MPI calls: “MPI_REAL8”, “MPI_INTEGER4”
• Be careful with intrinsic function use, as well
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 9
Porting Considerations, part 2• More things to watch out for in porting codes to the SP
– Arithmetic• Architecture tuning can help exploit special processor instructions
• Both T3E and SP can optimize beyond IEEE arithmetic
• T3E and PVP can also do fast reduced precision arithmetic
• Compiler options on T3E and SP can force IEEE compliance
• Compiler options can also throttle other optimizations for safety
• Special libraries offer faster intrinsics
– MPI• SP compilers and runtime will catch loose usage that was accepted on the T3E
• Communication bandwidth on SP Phase 1 is lower than on the T3E
• Message latency on the SP Phase 1 is higher than on the T3E
• We expect approximate parity with T3E in these areas, on the Phase 2 system
• Limited number of communication ports per node - approximately one per CPU
• “Default” versus “eager” buffer management in MPI_SEND
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 10
Porting Considerations, part 3• Compiling & linking
– “Version” is dependent on language and parallelization scheme• Language version
– Fortran 77: f77, xlf– Fortran 90: xlf90– Fortran 95: xlf95– C: cc, xlc, c89– C++: xlC
• MPI-included: mpxlf, mpxlf90, mpcc, mpCC• Thread-safe: xlf_r, xlf90_r, xlf95_r, mpxlf_r, mpxlf90_r
– Preprocessing can be ordered by compiler flag or source file suffix
• Use consistently, for all related compilations; the following may NOT produce a parallel executable:mpxlf90 -c *.F
xlf90 -o foo *.o
• Use -bmaxdata:bytes option to get more than a single 256 MB segment (up to 7 segments, or ~1.75 GB can be specified; only 3, or 0.75 GB, are real)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 11
Porting: MPI• MPI codes should port relatively well
• Use one MPI task per node or processor
– One per node during porting
– One per processor during production
– Let MPI worry about where it’s communicating to
– Environment variables, execution parameters, and/or batch options can specify
• # tasks per node
• Total # tasks
• Total # processors
• Total # nodes
• Communications subsystem in use– User Space is best in batch jobs
– IP may be best for interactive developmental runs
• There is a debug queue/class in batch
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 12
Porting: Shared Memory
• Don’t throw away old shared memory directives– OpenMP will work as is
– Cray Tasking directives will be useful for documentation
– We recommend porting Cray directives to OpenMP
– Even small-scale parallelism can be useful
– Larger scale parallelism will be available next year
• If your problems and/or algorithms will scale to larger granularities and greater parallelism, prepare for message passing– We recommend MPI
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 13
From Loop-slicing to MPI, before...allocate(A(1:imax,1:jmax))
!OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, imax, jmax)
do I = 1, imax
do J = 1, jmax
A(I,J) = deep_thought(A, I, J,…)
enddo
enddo
Sanity checking– Run the program on one CPU to get baseline answers
– Run on several CPUs to see parallel speedups and answers
• Optimization– Consider changing memory access patterns to improve cache usage
– How big can your problem get before you run out of real memory?
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 14
From Loop-slicing to MPI, after...call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
call my_indices(my_id, nprocs, my_imin, my_imax, my_jmin, my_jmax)
allocate(A(my_imin : my_imax, my_jmin : my_jmax))
!OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, my_imax, my_jmax, my_imax, my_jmax)
do I = my_imin, my_imax
do J = my_jmin, my_jmax
A(I,J) = deep_thought(A, I, J,…)
enddo
enddo
! Communicate the shared values with neighbors…
if(odd(my_ID)) then
call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)
else
call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)
call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr)
endif
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 15
From Loop-slicing to MPI, after...• You now have one MPI task and many OpenMP threads per node
– The MPI task does all the communicating between nodes
– The OpenMP threads do the parallelizable work
– Do NOT use MPI within an OpenMP parallel region
• Sanity checking– Run on one node and one CPU to check baseline answers
– Run on one node and several CPUs to see parallel speedup and answers
– Run on several nodes, one CPU per node, and check answers
– Run on several nodes, several CPUs per node, and check answers
• Scaling checking– Run a larger version of a similar problem on the same set of ensemble sizes
– Run the same sized problem on a larger ensemble
• (Re-)Consider your I/O strategy…
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 16
From MPI to Loop-slicing
• Add OpenMP directives to existing code
• Perform sanity and scaling checks, as before
• Results in same overall code structure as on previous slides– One MPI task and several OpenMP threads per node
• For irregular codes, Pthreads may serve better, at the cost of increased complexity
• Nobody really expects it to be this easy...
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 17
Using the Machine, part 1
• Somewhat similar to the Crays– Interactive and batch jobs are possible
Class Max Nodes Max Processors Max Time Priority
debug 16 32 30 minutes 20000
premium 256 512 4 hours 10000
regular 256 512 4 hours 5000
low 256 512 4 hours 1
interactive 8 16 20 minutes 15000
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 18
Using the Machine , part 2
• Interactive runs– Sequential executions run immediately on your login node
– Every login will likely put you on a different node, so be careful about looking for your executions - “ps” returns info about only the node you’re logged into.
– Small scale parallel jobs may be rejected if LoadLeveler can’t find the resources
– There are two pools of nodes that can be used for interactive jobs:• Login nodes
• A small subset of the compute nodes
– Parallel execution can often be achieved by• Trying again, after initial rejection
• Changing communication mechanisms from User Space to IP
• Using the other pool
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 19
Using the Machine , part 3
• Batch jobs– Currently, very similar in capability to the T3E
• Similar run times, processor counts
• More memory available on the SP
– Limits and capabilities may change, as we learn the machine
– LoadLeveler is similar to, but simpler than NQE/NQS on the T3E• Jobs are submitted, monitored, and cancelled by special commands
• Each batch job requires a script that is essentially a shell script
• The first few lines contain batch options that look like comments to the shell
• The rest of the script can contain any shell constructs
• Scripts can be debugged by executing them interactively
• Users are limited to 3 running jobs, 10 queued jobs, and 30 submitted jobs, at any given time
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 20
Using the Machine , part 4
• File systems– Use the environment variables to let the system manage your file
usage
– Sequential work can be done in $HOME (not backed up) or $TMPDIR (transient)
• Medium performance, node-local
– Parallel work can be done in $SCRATCH (transient) or /scratch/username (purgeable)
• High performance, located in GPFS
– HPSS is available from batch jobs, via HSI, and interactively via FTP, PFTP, and HIS
– There are quotas on space and inode usage
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
June 2000 SP Evolution - NERSC User Services 21
Using the Machine , part 4• The future?
– The allowed scale of parallelism (CPU counts) may change• Max now = 512 CPUs, same as on T3E
– The allowed duration of runs may change• Max now = 4 hours; Max on T3E = 12 hours
– The size of possible problems will definitely change• More CPUs in phase 1 than the T3E
• More memory per cpu, in both phases, than on T3e
– The amount of work possible per unit time will definitely change• CPUs in both phases are faster than those on the T3E
• Phase 2 interconnect will be faster than on Phase 1
– Better machine management• Checkpointing will be available
• We will learn what can be adjusted in the batch system
– There will be more and better tools for monitoring and tuning• HPM, KAP, Tau, PAPI...
– Some current problems will go away (e.g. memory mapped files)