scaling up mpi and mpi-i/o on seaborg.nersc david skinner, nersc division, berkeley lab

Post on 20-Jan-2016

27 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab. Scaling: Motivation. NERSC’s focus is on capability computation Capability == jobs that use ¼ or more of the machines resources Parallelism can deliver scientific results unattainable on workstations. - PowerPoint PPT Presentation

TRANSCRIPT

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

Scaling Up MPI and MPI-I/O onseaborg.nersc.gov

David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

2

Scaling: Motivation

• NERSC’s focus is on capability computation– Capability == jobs that use ¼ or more of the machines resources

• Parallelism can deliver scientific results unattainable on workstations.

• “Big Science” problems are more interesting!

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

3

Scaling: Challenges

• CPU’s are outpacing memory bandwidth and switches, leaving FLOPs increasingly isolated.

• Vendors often have machines < ½ the size of NERSC machines: system software may be operating in uncharted regimes– MPI implementation

– Filesystem metadata systems

– Batch queue system

• NERSC consultants can help

Users need information on how to mitigate the impact of these issues for large concurrency applications.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

4

Seaborg.nersc.gov

MP_EUIDEVICE

(switch fabric)

MPI Bandwidth

(MB/sec)

MPI Latency

(usec)

css0 500 / 350 8 / 16

css1

csss 500 / 350

(single task)

8 / 16

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

5

Switch Adapter Bandwidth: csss

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

6

Switch Adapter Comparison

csss

css0

Tune messagesize to optimizethroughput

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

7

Switch Adapter Considerations

• For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio)

• Use MP_SHAREDMEMORY to minimize switch traffic

• csss is most often the best route to the switch

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

8

Job Start Up times

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

9

Synchronization

• On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks

• A fully synchronizing MPI call requires everyone’s attention

• By analogy, imagine trying to go to lunch with 1024 people

• Probability that everyone is ready at any given time scales poorly

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

10

Scaling of MPI_Barrier()

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

11

Load Balance

• If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall

• Seek out and eliminate sources of variation• Distribute problem uniformly among nodes/cpus

0 20 40 60 80 100

0

1

2

3 FLOPI/OSYNCFLOPI/OSYNC

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

12

Synchronization: MPI_Bcast 2048 tasks

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

13

Synchronization: MPI_Alltoall 2048 tasks

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

14

Synchronization (continued)

• MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above

• Use MPI_Bcast if possible which is not fully synchronizing

• Remove un-needed MPI_Barrier calls

• Use Immediate Sends and Asynchronous I/O when possible

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

15

Improving MPI Scaling on Seaborg

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

16

The SP switch

• Use MP_SHAREDMEMORY=yes (default)

• Use MP_EUIDEVICE=csss (default)

• Tune message sizes

• Reduce synchronizing MPI calls

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

17

64 bit MPI

• 32 bit MPI has inconvenient memory limits – 256MB per task default and 2GB maximum– 1.7GB can be used in practice, but depends on MPI usage – The scaling of this internal usage is complicated, but larger

concurrency jobs have more of their memory “stolen” by MPI’s internal buffers and pipes

• 64 bit MPI removes these barriers– 64 bit MPI is fully supported– Just remember to use “_r” compilers and “-q64”

• Seaborg has 16,32, and 64 GB per node available

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

18

How to measure MPI memory usage?

2048tasks

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

19

MP_PIPE_SIZE : 2*PIPE_SIZE*(ntasks-1)

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

20

OpenMP

• Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation,

e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads

• Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

21

Beware Hidden Multithreading

• ESSL and IBM Fortran have autotasking like “features” which function via creation of unspecified numbers of threads.

• Fortran RANDOM_NUMBER intrinsic has some well known scaling problems.

http://www.nersc.gov/projects/scaling/random_number.html

• XLF, use threads to auto parallelize my code “-qsmp=auto”.

ESSL, libesslsmp.a has an autotasking feature

• Synchronization problems are unpredictable using these features. Performance impacted when too many threads.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

22

MP_LABELIO, phost

• Labeled I/O will let you know which task generated the message “segmentation fault” , gave wrong answer, etc.

export MP_LABELIO=yes

• Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks– MPI and LAPI versions available

– Hostslists are useful in general

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

23

Core files

• Core dumps don’t scale (no parallel work)

• MP_COREDIR=none No corefile I/O• MP_COREFILE_FORMAT=light_core Less I/O• LL script to save just one full fledged core file, throw away

others … if MP_CHILD !=0 export MP_COREDIR=/dev/nullendif…

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

24

Debugging

• In general debugging 512 and above is error prone and cumbersome.

• Debug at a smaller scale when possible.

• Use shared memory device MPICH on a workstation with lots of memory as a mock up high concurrency environment.

• For crashed jobs examine LL logs for memory usage history.

(ask a NERSC consultant for help with this)

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

25

Parallel I/O

• Can be a significant source of variation in task completion prior to synchronization

• Limit the number of readers or writers when appropriate. Pay attention to file creation rates.

• Output reduced quantities when possible

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

26

Summary

• Resources are present to face the challenges posed by scaling up MPI applications on seaborg.

• Hopefully, scientists will expand their problem scopes to tackle increasingly challenging computational problems.

• NERSC consultants can provide help in achieving scaling goals.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

27

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

28

Scaling of Parallel I/O on GPFS

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

29

Motivation

• NERSC uses GPFS for $HOME and $SCRATCH

• Local disk filesystems on seaborg (/tmp) are tiny

• Growing data sizes and concurrencies often outpace I/O methodologies

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

30

GPFS@Seaborg.nersc.gov

Each compute node relieson the GPFS nodes as gateways to storage

16 nodes are dedicated toserving GPFS filesystems

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

31

Common Problems when Implementing Parallel IO

• CPU utilization suffers as time is lost to I/O

• Variation in write times can be severe, leading to batch job failure

Time to write 100GB

0100200300400500

1 2 3 4iteration

Tim

e (s

)

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

32

Finding solutions

• Checkpoint (saving state) IO pattern

• Survey strategies to determine the rate and variation in rate

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

33

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

34

Parallel I/O Strategies

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

35

Multiple File I/O

if(private_dir) rank_dir(1,rank); fp=fopen(fname_r,"w"); fwrite(data,nbyte,1,fp); fclose(fp); if(private_dir) rank_dir(0,rank); MPI_Barrier(MPI_COMM_WORLD);

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

36

Single File I/O

fd=open(fname,O_CREAT|O_RDWR, S_IRUSR);

lseek(fd,(off_t)(rank*nbyte)-1,SEEK_SET);

write(fd,data,1);

close(fd);

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

37

MPI-I/O

MPI_Info_set(mpiio_file_hints, MPIIO_FILE_HINT0);MPI_File_open(MPI_COMM_WORLD, fname, MPI_MODE_CREATE | MPI_MODE_RDWR, mpiio_file_hints, &fh);MPI_File_set_view(fh, (off_t)rank*(off_t)nbyte, MPI_DOUBLE, MPI_DOUBLE, "native", mpiio_file_hints);MPI_File_write_all(fh, data, ndata, MPI_DOUBLE, &status);MPI_File_close(&fh);

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

38

Results

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

39

Scaling of single file I/O

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

40

Scaling of multiple file and MPI I/O

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

41

Large block I/O

• MPI I/O on the SP includes the file hint IBM_largeblock_io

• IBM_largeblock_io=true used throughout, default values show large variation

• IBM_largeblock_io=true also turns off data shipping

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

42

Large block I/O = false

• MPI on the SP includes the file hint IBM_largeblock_io

• Except above IBM_largeblock_io=true used throughout

• IBM_largeblock_io=true also turns off data shipping

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

43

Bottlenecks to scaling

• Single file I/O has a tendency to serialize

• Scaling up with multiple files create filesystem problems

• Akin to data shipping consider the intermediate case

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

44

Parallel IO with SMP aggregation (32 tasks)

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

45

Parallel IO with SMP aggregation (512 tasks)

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

46

Summary

2048

1024

512

256

128

64

32

16

1

MB

10

MB

100

MB

1

GB

10

G

100

G

Serial

Multiple File

Multiple File

mod n

MPI IO

MPI IO collective

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

47

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

48

top related