parallel io course

8/2/2019 Parallel Io Course

1/75

Parallel I/O

SciNet

www.scinet.utoronto.ca

University of Toronto

Toronto, Canada

October 19, 2010


2/75

Outline / Schedule

1 Introduction

2 File Systems and I/O

3 Data Management

4 Break

5 Parallel I/O

6 MPI-IO

7 HDF5/NETCDF


3/75

Disk I/O

Common Uses

Checkpoint/Restart Files

Data AnalysisData Organization

Time accurate and/or Optimization Runs

Batch and Data processing

Database


4/75

Disk I/O

Common Bottlenecks

Mechanical disks are slow!

System call overhead (open, close, read, write)Shared file system (nfs, lustre, gpfs, etc)

HPC systems typically designed for high bandwidth (GB/s)not IOPs

Uncoordinated independent accesses


5/75

Disk Access Rates over Time

Figure by R. Ross, Argonne National Laboratory, CScADS09


6/75

Memory/Storage Latency

Figure by R. Freitas and L Chiu, IBM Almaden Labs, FAST10


7/75

Definitions

IOPsInput/Output Operations Per Second (read,write,open,close,seek)

I/O Bandwidth

Quantity you read/write (think network bandwidth)

Comparisons

Device Bandwidth (MB/s) per-node IOPs per-node

SATA HDD 100 100 100 100SSD HDD 250 250 4000 4000SciNet 5000 1.25 30000 7.5


8/75

SciNet Filesystem

File System

1,790 1TB SATA disk drives, for a totalof 1.4PB

Two DCS9900 couplets, each delivering:4-5 GB/s read/write access (bandwidth)30,000 IOPs max (open, close, seek, . . . )

Single GPFS file system on TCS and GPC

I/O goes over Gb ethernet network on GPC(infiniband on TCS)

File system is parallel!


9/75

I/O Software Stack


10/75

Parallel File System


11/75



12/75



13/75


S


14/75


P ll l Fil S


15/75


P ll l Fil S


16/75


P ll l Fil S t


17/75


P ll l Fil S t


18/75


File Locks

Most parallel file systems use locks to manage concurrent fileaccess

Files are broken up into lock units

Clients obtain locks on units that they will access before I/Ooccurs

Enables caching on clients as well (as long as client has a

lock, it knows its cached data is valid)Locks are reclaimed from clients when others desire access

P ll l Fil S st


19/75


Optimal for large shared files.

Behaves poorly under many small reads and writes, high IOPs

Your use of it affects everybody!(Different from case with CPU and RAM which are not

shared.)How you read and write, your file format, the number of filesin a directory, and how often you ls, affects every user!

The file system is shared over the ethernet network on GPC:

Hammering the file system can hurt process communications.File systems are not infinite!Bandwidth, metadata, IOPs, number of files, space, . . .



20/75


2 jobs doing simultaneous I/O can take much longer than

twice a single job duration due to disk contention anddirectory locking.

SciNet: 500+ users doing I/O from 4000 nodes.Thats a lot of sharing and contention!

I/O Best Practices


21/75

I/O Best Practices

Make a plan

Make a plan for your data needs:How much will you generate,How much do you need to save,And where will you keep it?

Note that /scratch is temporary storage for 3 months or less.

Options?

1 Save on your departmental/local server/workstation(it is possible to transfer TBs per day on a gigabit link);

2 Apply for a project space allocation at next RAC call(but space is very limited);

3 Buy tapes through us ($100/TB) and we can archive yourdata to tape; HSM possibility within next 6 months;

4 Change storage format.

I/O Best Practices


22/75

I/O Best Practices

Monitor and control usage

Minimize use of filesystem commands like ls and du.Regularly check your disk usage using/scinet/gpc/bin/diskUsage.

Warning signs which should prompt careful consideration:

More than 100,000 files in your spaceAverage file size less than 100 MB

Monitor disk actions with top and strace

RAM is always faster than disk; think about using ramdisk.

Use gzip and tar to compress files to bundle many files intoone

Try gziping your data files. 30% not atypical!

Delete files that are no longer needed

Do housekeeping (gzip, tar, delete) regularly.

I/O Best Practices


23/75

I/O Best Practices

Dos

Write binary format filesFaster I/O and less space than ASCII files.

Use parallel I/O if writing from many nodes

Maximize size of files. Large block I/O optimal!

Minimize number of files. Makes filesystem more responsive!

Donts

Dont write lots of ASCII files. Lazy, slow, and wastes space!

Dont write many hundreds of files in a 1 directory. (FileLocks)

Dont write many small files (< 10MB).System is optimized for large-block I/O.


24/75

1 Introduction


3 Data Management

4 Break

5 Parallel I/O

6

MPI-IO

7 HDF5/NETCDF

Data Management


25/75

Data Management

Formats

ASCII

BinaryMetaData (XML)

Databases

Standard Librarys (HDF5,NetCDF)

ASCII


26/75

ASCII

American Standard Code for Information Interchange

Pros

Human Readable

Portable (architecture independent)

Cons

Inefficient Storage

Expensive for Read/Write (conversions)

Native Binary


27/75

Native Binary

100100100

Pros

Efficient Storage (256 x floats @4bytes takes 1024 bytes)

Efficient Read/Write (native)

Cons

Have to know the format to read

Portability (Endianness)

ASCII vs. binary


28/75

ASCII vs. binary

Writing 128M doubles

Format /scratch (GPCS) /dev/shm (RAM) /tmp (disk)

ASCII 173s 174s 260sBinary 6s 1s 20s

Syntax

Format C FORTRAN

ASCII fprintf() open(6,file=test,form=formatted)

write(6,*)Binary fwrite() open(6,file=test,form=unformatted)

write(6)

Metadata


29/75

What is Metadata?

Data about DataFile System: size, location, date, owner, etc.

App Data: File format, version, iteration, etc.

Example: XML

UTF1000

6.8

January 15th, 2010

47 23.516 -122 02.625

Databases


30/75

Beyond flat files

Very powerful and flexible storage approach

Data organization and analysis can be greatly simplified

Enhanced performance over seek/sort depending on usage

Open Source Software

SQLite (serverless)PostgreSQL

mySQL

Standard Formats


31/75

CGNS (CFD General Notation System)

IGES/STEP (CAD Geometry)

HDF5 (Hierarchical Data Format)

NetCDF (Network Common Data Format)

disciplineX version

HDF5/NetCDF - Serial


32/75

/

Jonathan


33/75

1 Introduction


3 Data Management

4 Break

5 Parallel I/O

6 MPI-IO

7 HDF5/NETCDF


34/75

1 Introduction


3 Data Management

4 Break

5 Parallel I/O

6 MPI-IO

7 HDF5/NETCDF

Common Ways of Doing Parallel I/O


35/75

/

Sequential I/O (only proc 0 Writes/Reads)

ProTrivially simple for small I/OSome I/O libraries not parallel

Con

Bandwidth limited by rate one client can sustainMay not have enough memory on node to hold all dataWont scale (built in bottleneck)



36/75

N files for N Processes

Pro

No interprocess communication or coordination necessaryPossibly better scaling than single sequential I/O

Con

As process counts increase, lots of (small) files, wont scaleData often must be post-processed into one fileUncoordinated I/O may swamp file system (File LOCKS!)



37/75

All Processes Access One File

ProOnly one fileData can be stored canonically, avoiding post-processingWill scale if done correctly

Con

Uncoordinated I/O WILL swamp file system (File LOCKS!)Requires more design and thought

Parallel I/O


38/75

What is Parallel I/O?

Multiple processes of a parallel program accessing data (reading orwriting) from a common file.

Parallel I/O


39/75

Why Parallel I/O?

Non-parallel I/O is simple but:

Poor performance (single process writes to one file)

Awkward and not interoperable with other tools (each processwrites a separate file)

Parallel I/O

Higher performance through collective and contiguous I/OSingle file (visualization, data management, storage, etc)

Works with file system not against it

Contiguous and Noncontiguous I/O


40/75

Contiguous I/O move from a single memory block into a single file block

Noncontiguous I/O has three forms:

Noncontiguous in memory, in file, or in both

Structured data leads naturally to noncontiguous I/O(e.g. block decomposition)

Describing noncontiguous accesses with a single operation passes moreknowledge to I/O system

Independent and Collective I/O


41/75

Independent I/O operations specify only what a single process will do

calls obscure relationships between I/O on other processes

Many applications have phases of computation and I/O

During I/O phases, all processes read/write dataWe can say they are collectively accessing storage

Collective I/O is coordinated access to storage by a group of processes

functions are called by all processes participating in I/OAllows file system to know more about access as a whole, more

optimization in lower software layers, better performance

Parallel I/O


42/75

Available Approaches

MPI-IO: MPI-2 Language Standard

HDF (Hierarchical Data Format)

NetCDF (Network Common Data Format)Adaptable IO System (ADIOS)

Actively developed (OLCF,SandiaNL,GeorgiaTech) and usedon largest HPC systems (Jaguar,Blue Gene/P)External to the code XML file describing the various elements

Uses MPI-IO, can work with HDF/NetCDF

MPI-IO


43/75

MPI-IO

MPI-IO


44/75

MPIMPI: Message Passing Interface

Language-independent communications protocol used toprogram parallel computers.

MPI-IO: Parallel file access protocol

MPI-IO: The parallel I/O part of the MPI-2 standard (1996).

Many other parallel I/O solutions are built upon it.

Versatile and better performance than standard unix I/O.Usually collective I/O is the most efficient.

MPI-IO


45/75

Advantages MPI-IO

noncontiguous access of files and memory

collective I/O

individual and shared file pointers

explicit offsets

portable data representation

can give hints to implementation/file system

no text/formatted output!

MPI-IO


46/75

MPI conceptsProcess: An instance of your program, often 1 per core.

Communicator: Groups of processes and their topology.Standard communicators:

MPI COMM WORLD: all processes launched by mpirun.MPI COMM SELF: just this process.

Size: the number of processes in the communicator.

Rank: a unique number assigned to each process in thecommunicator group.

When using MPI, each process always call MPI INIT at thebeginning and MPI FINALIZE at the end of your program.

MPI-IO


47/75

Basic MPI code example

in C:# i n c l u d e < m p i . h >

i n t m a i n ( i n t a r g c , c h a r * * a r g v )

{i n t r a n k , n p r o c s ;

M P I I n i t ( & a r g c , & a r g v ) ;

M P I C o m m s i z e

( M P I C O M M W O R L D , & n p r o c s ) ;

M P I C o m m r a n k

( M P I C O M M W O R L D , & r a n k ) ;

. . .

M P I F i n a l i z e ( ) ;

r e t u r n 0 ;

}

in Fortran:p r o g r a m m a i n

i n c l u d e ' m p i f . h '

i n t e g e r r a n k , n p r o c s

i n t e g e r i e r r

c a l l M P I I N I T ( i e r r )

c a l l M P I C O M M S I Z E &

( M P I C O M M W O R L D , n p r o c s , i e r r )

c a l l M P I C O M M R A N K &

( M P I C O M M W O R L D , r a n k , i e r r )

. . .

c a l l M P I F I N A L I Z E ( i e r r )

r e t u r n

e n d

MPI-IO


48/75

MPI-IO exploits analogies with MPI

Writing Sending message

Reading Receiving message

File access grouped via communicator: collective operationsUser defined MPI datatypes for e.g. noncontiguous data layout

IO latency hiding much like communication latency hiding(IO may even share network with communication)

All functionality through function calls.

MPI-IOBasic I/O Operations - C


49/75

i n t M P I F i l e o p e n

(M P I C o m m

comm,c h a r *

filename,i n t

amode,M P I I n f o info, M P I F i l e * fh)

i n t M P I F i l e s e e k (M P I F i l e fh,M P I O f f s e t offset, i n t to)

i n t M P I F i l e s e t v i e w (M P I F i l e fh, M P I O f f s e t disp,

M P I D a t a t y p e etype,M P I D a t a t y p e

filetype,

c h a r * datarep, M P I I n f o info)

i n t M P I F i l e r e a d (M P I F i l e fh, v o i d * buf, i n t count,

M P I D a t a t y p e

datatype,M P I S t a t u s *

status)i n t M P I F i l e w r i t e (M P I F i l e fh, v o i d * buf, i n t count,

M P I D a t a t y p e datatype,

M P I S t a t u s * status)

i n t M P I F i l e c l o s e (

M P I F i l e * fh)

MPI-IOBasic I/O Operations - Fortran


50/75

M P I F I L E O P E N (comm,filename,amode,info,fh,ierr)c h a r a c t e r * ( * )

filenamei n t e g e r comm,amode,info,fh,ierr

M P I F I L E S E E K (fh,offset,whence,ierr)i n t e g e r ( k i n d = M P I O F F S E T K I N D )

offseti n t e g e r fh,whence,ierr

M P I F I L E S E T V I E W (fh,disp,etype,filetype,datarep,info,ierr)i n t e g e r ( k i n d = M P I O F F S E T K I N D ) dispi n t e g e r

fh,etype,filetype,info,ierrc h a r a c t e r * ( * ) datarep

M P I F I L E R E A D (fh,buf,count,datatype,status,ierr) t y p e buf(*)i n t e g e r

fh,count,datatype,status(MPI STATUS SIZE),ierrM P I F I L E W R I T E (fh,buf,count,datatype,status,ierr) t y p e buf(*)i n t e g e r fh,count,datatype,status(MPI STATUS SIZE),ierr

M P I F I L E C L O S E (fh)i n t e g e r fh

MPI-IOOpening and closing a file


51/75

Files are maintained via file handles. Open files with M P I F i l e o p e n .The following codes open a file for reading, and close it right away:

in C:M P I F I L E f h ;

M P I F i l e o p e n ( M P I C O M M W O R L D , " t e s t . d a t " , M P I M O D E R D O N L Y ,

M P I I N F O N U L L , & f h ) ;

M P I F i l e c l o s e ( & f h ) ;

in Fortran:i n t e g e r f h , i e r r

c a l l M P I F I L E O P E N ( M P I C O M M W O R L D , " t e s t . d a t " , &

M P I M O D E R D O N L Y , M P I I N F O N U L L , f h , i e r r )

c a l l M P I F I L E C L O S E ( f h , i e r r )

MPI-IO


52/75

Opening a file requires...

communicator,

file name,file handle, for all future reference to file,file mode, made up of combinations of:

M P I M O D E R D O N L Y read only

M P I M O D E R D W R reading and writingM P I M O D E W R O N L Y

write onlyM P I M O D E C R E A T E create file if it does not existM P I M O D E E X C L

error if creating file that existsM P I M O D E D E L E T E O N C L O S E

delete file on closeM P I M O D E U N I Q U E O P E N

file not to be opened elsewhereM P I M O D E S E Q U E N T I A L

file to be accessed sequentiallyM P I M O D E A P P E N D position all file pointers to end

info structure, or M P I I N F O N U L L ,In Fortran, error code is the functions last argument

In C, the function returns the error code.

MPI-IO


53/75

etypes, filetypes, file viewsTo make binary access a bit more natural for many applications,MPI-IO defines file access through the following concepts:

1 etype: Allows to access the file in units other than bytes.Some parameters have to be given in bytes.

2 filetype: Each process defines what part of a shared file it uses.

Filetypes specify a pattern which gets repeated in the file.Useful for noncontiguous access.For contiguous access, often etype=filetype.

3

displacement: Where to start in the file, in bytes.Together, these specify the file view, set by

M P I F i l e s e t v i e w .

Default view has etype=filetype=MPI BYTE and displacement 0.

MPI-IOContiguous Data


54/75

Processes

P(0)q P(1)q P(2)q P(3)q . .

view(0) q view(1) q view(2) q view(3) qq

One filei n t b u f [ . . . ] ;

M P I O f f s e t b u f s i z e = . . . ;

M P I F i l e o p e n ( M P I C O M M W O R L D , " f i l e " , M P I M O D E W R O N L Y ,

M P I I N F O N U L L , & f h ) ;

M P I O f f s e t d i s p = r a n k * b u f s i z e * s i z e o f ( i n t ) ;

M P I F i l e s e t v i e w ( f h , d i s p , M P I I N T , M P I I N T , " n a t i v e " ,

M P I I N F O N U L L ) ;

M P I F i l e w r i t e ( f h , b u f , b u f s i z e , M P I I N T , M P I S T A T U S I G N O R E ) ;

M P I F i l e c l o s e ( & f h ) ;

MPI-IOOverview of all read functions


55/75

Single task CollectiveIndividual file pointer

blockingM P I F i l e r e a d M P I F i l e r e a d a l l

nonblockingM P I F i l e i r e a d M P I F i l e r e a d a l l b e g i n

+ ( M P I W a i t ) M P I F i l e r e a d a l l e n d

Explicit offsetblocking

M P I F i l e r e a d a t M P I F i l e r e a d a t a l l

nonblockingM P I F i l e i r e a d a t M P I F i l e r e a d a t a l l b e g i n

+ ( M P I W a i t ) M P I F i l e r e a d a t a l l e n d

Shared file pointer

blockingM P I F i l e r e a d s h a r e d M P I F i l e r e a d o r d e r e d

nonblockingM P I F i l e i r e a d s h a r e d M P I F i l e r e a d o r d e r e d b e g i n

+ ( M P I W a i t ) M P I F i l e r e a d o r d e r e d e n d

MPI-IOOverview of all write functions


56/75

Single task CollectiveIndividual file pointer

blockingM P I F i l e w r i t e M P I F i l e w r i t e a l l

nonblockingM P I F i l e i w r i t e M P I F i l e w r i t e a l l b e g i n

+ ( M P I W a i t ) M P I F i l e w r i t e a l l e n d

Explicit offsetblocking

M P I F i l e w r i t e a t M P I F i l e w r i t e a t a l l

nonblockingM P I F i l e i w r i t e a t M P I F i l e w r i t e a t a l l b e g i n

+ ( M P I W a i t ) M P I F i l e w r i t e a t a l l e n d

Shared file pointer

blockingM P I F i l e w r i t e s h a r e d M P I F i l e w r i t e o r d e r e d

nonblockingM P I F i l e i w r i t e s h a r e d M P I F i l e w r i t e o r d e r e d b e g i n

+ ( M P I W a i t ) M P I F i l e w r i t e o r d e r e d e n d

MPI-IO


57/75

Collective vs. single task

After a file has been opened and a fileview is defined, processescan independently read and write to their part of the file.

If the IO occurs at regular spots in the program, which differentprocesses reach the same time, it will be better to use collective

I/O: These are the a l l versions of the MPI-IO routines.

Two file pointers

An MPI-IO file has two different file pointers:

1 individual file pointer: one per process.2 shared file pointer: one per file: s h a r e d / o r d e r e d

Shared doesnt mean collective, but does imply synchronization!

MPI-IOStrategic considerations


58/75

Pros for single task I/O

One can virtually always use only indivivual file pointers,

If timings variable, no need to wait for other processes

Cons

If there are interdependences between how processes write,there may be collective I/O operations may be faster.

Collective I/O can collect data before doing the write or read.

True speed depends on file system, size of data to write andimplementation.

MPI-IONoncontiguous Data


59/75

Processes

P(0)q P(1)q P(2)q P(3)q .

etc. .q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

One file

Filetypes to the rescue!

Define a 2-etype basic MPI Datatype.

Increase its size to 8 etypes.

Shift according to rank to pick out the right 2 etypes.

Use the result as the filetype in the file view.

Then gaps are automatically skipped.

MPI-IOOverview of data/filetype constructors


60/75

Function Creates a. . .

M P I T y p e c o n t i g u o u s contiguous datatype

M P I T y p e v e c t o r vector (strided) datatypeM P I T y p e i n d e x e d

indexed datatypeM P I T y p e i n d e x e d b l o c k indexed datatype w/uniform block lengthM P I T y p e c r e a t e s t r u c t

structured datatypeM P I T y p e c r e a t e r e s i z e d

type with new extent and boundsM P I T y p e c r e a t e d a r r a y

distributed array datatypeM P I T y p e c r e a t e s u b a r r a y

n-dim subarray of an n-dim array. . .

Before using the create type, you have to doM P I C o m m i t

.

MPI-IOAccessing a noncontiguous file type


61/75

q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

in C:M P I D a t a t y p e c o n t i g , f t y p e ;

M P I D a t a t y p e e t y p e = M P I I N T ;

M P I A i n t e x t e n t = s i z e o f ( i n t ) * 8 ; / * i n b y t e s ! * /

M P I O f f s e t d = 2 * s i z e o f ( i n t ) * r a n k ; / * i n b y t e s ! * /

M P I T y p e c o n t i g u o u s ( 2 , e t y p e , & c o n t i g ) ;

M P I T y p e c r e a t e r e s i z e d ( c o n t i g , 0 , e x t e n t , & f t y p e ) ;

M P I T y p e c o m m i t ( & f t y p e ) ;

M P I F i l e s e t v i e w ( f h , d , e t y p e , f t y p e , " n a t i v e " ,


MPI-IOAccessing a noncontiguous file type


62/75

q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

in Fortran:i n t e g e r : : e t y p e , e x t e n t , c o n t i g , f t y p e , i e r r

i n t e g e r ( k i n d = M P I O F F S E T K I N D ) : : d

e t y p e = M P I I N T

e x t e n t = 4 * 8

d = 4 * r a n k

c a l l M P I T Y P E C O N T I G U O U S ( 2 , e t y p e , c o n t i g , i e r r )

c a l l M P I T Y P E C R E A T E R E S I Z E D ( c o n t i g , 0 , e x t e n t , f t y p e , i e r r )

c a l l M P I T Y P E C O M M I T ( f t y p e , i e r r )

c a l l M P I F I L E S E T V I E W ( f h , d , e t y p e , f t y p e , " n a t i v e " ,

M P I I N F O N U L L , i e r r )

MPI-IO


63/75

File data representation

native: Data is stored in the file as it is in memory:no conversion is performed. No loss inperformance, but not portable.

internal: Implementation dependent conversion.Portable across machines with the sameMPI implementation, but not acrossdifferent implementations.

external32: Specific data representation, basically32-bit big-endian IEEE format. See MPI

Standard for more info. Completelyportable, but not the best performance.

These have to be given to MPI File set view as strings.

MPI-IO

More noncontiguous data: subarrays


64/75

More noncontiguous data: subarrays

What if theres a 2d matrix that is distributed across processes?

proc0q proc1q proc2q proc3q .

.q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q






q

Common cases of noncontiguous access specialized functions:M P I F i l e c r e a t e s u b a r r a y

&M P I F i l e c r e a t e d a r r a y

.

MPI-IOMore noncontiguous data: subarrays


65/75

i n t g s i z e s [ 2 ] = {1 6 , 6 };i n t l s i z e s [ 2 ] = {8 , 3 };i n t p s i z e s [ 2 ] = {2 , 2 };i n t c o o r d s [ 2 ] =

{r a n k % p s i z e s [ 0 ] , r a n k / p s i z e s [ 0 ]

};

i n t s t a r t s [ 2 ] = {

c o o r d s [ 0 ] * l s i z e s [ 0 ] , c o o r d s [ 1 ] * l s i z e s [ 1 ] }

;

M P I T y p e c r e a t e s u b a r r a y ( 2 , g s i z e s , l s i z e s , s t a r t s , ,

M P I O R D E R C , M P I I N T , & f i l e t y p e ) ;

M P I T y p e c o m m i t ( & f i l e t y p e ) ;

M P I F i l e s e t v i e w ( f h , 0 , M P I I N T , f i l e t y p e , " n a t i v e " ,


M P I F i l e w r i t e a l l ( f h , l o c a l a r r a y , l o c a l a r r a y s i z e , M P I I N T ,

M P I S T A T U S I G N O R E ) ;

Tip

M P I C a r t c r e a t e can be useful to compute coords for a proc.

MPI-IOExample: N-body MD checkpointing


66/75

Challenge

Simulate n-body molecular dynamics, Lennard-Jones potential.

Parallel MPI run: atoms distributed.

At intervals, checkpoint the state of system to 1 file.

Restart should be allowed to use more or less mpi processes.

Restart should be efficient.

State of the system: total of tot atoms with properties:

s t r u c t A t o m {

d o u b l e q [ 3 ] ;

d o u b l e p [ 3 ] ;

l o n g l o n g t a g ;

l o n g l o n g i d ;

};

t y p e a t o m

d o u b l e p r e c i s i o n : : q ( 3 )

d o u b l e p r e c i s i o n : : p ( 3 )

i n t e g e r ( k i n d = 8 ) : : t a g

i n t e g e r ( k i n d = 8 ) : : i d

e n d t y p e a t o m



67/75

Issues

Atom data more than array of doubles: indices etc.

Writing problem: processes have different # of atomshow to define views, subarrays, ... ?

Reading problem: not known which process gets which atoms.

Even worse if number of processes is changed in restart.

Approach

Abstract the atom datatype.

Compute where in file each proc. should write + how much.Store that info in header.

Restart with same nprocs is then straightforward.

Different nprocs: MPI exercise outside scope of 1-day class.



68/75

Defining the Atom etype

s t r u c t A t o m a t o m ;

i n t i , l e n [ 4 ] = {

3 , 3 , 1 , 1 }

;

M P I A i n t a d d r [ 5 ] ;

M P I A i n t d i s p [ 4 ] ;

M P I G e t a d d r e s s ( & a t o m , & ( a d d r [ 0 ] ) ) ;

M P I G e t a d d r e s s ( & ( a t o m . q [ 0 ] ) , & ( a d d r [ 1 ] ) ) ;

M P I G e t a d d r e s s ( & ( a t o m . p [ 0 ] ) , & ( a d d r [ 2 ] ) ) ;

M P I G e t a d d r e s s ( & a t o m . t a g , & ( a d d r [ 3 ] ) ) ;

M P I G e t a d d r e s s ( & a t o m . i d , & ( a d d r [ 4 ] ) ) ;

f o r ( i = 0 ; i < 4 ; i + + )

d i s p [ i ] = a d d r [ i ] - a d d r [ 0 ] ;

M P I D a t a t y p e t [ 4 ]

={

M P I D O U B L E , M P I D O U B L E , M P I L O N G L O N G , M P I L O N G L O N G }

;

M P I T y p e c r e a t e s t r u c t ( 4 , l e n , d i s p , t , & M P I A T O M ) ;

M P I T y p e c o m m i t ( & M P I A T O M ) ;



69/75

File structuretotq pq n[0]q n[1]q.. n[p-1]q n[0]atomsq n[1]atomsq.. n[p-1]atomsq

header: integers

body: atoms

Functions

v o i d c p w r i t e ( s t r u c t A t o m * a, i n t n, M P I D a t a t y p e t,c h a r * f, M P I C o m m c)

v o i d c p r e a d ( s t r u c t A t o m * a, i n t * n, i n t tot,M P I D a t a t y p e t, c h a r * f, M P I C o m m c)

v o i d r e d i s t r i b u t e () consider done

MPI-IO: Checkpoint writing

v o i d c p w r i t e ( s t r u c t A t o m * a , i n t n , M P I D a t a t y p e t , c h a r * f , M P I C o m m c ) {


70/75

i n t p , r ;

M P I C o m m s i z e ( c , & p ) ;

M P I C o m m r a n k ( c , & r ) ;

i n t h e a d e r [ p + 2 ] ;

M P I A l l g a t h e r ( & n , 1 , M P I I N T , & ( h e a d e r [ 2 ] ) , 1 , M P I I N T , c ) ;

i n t i , n b e l o w = 0 ;

f o r ( i = 0 ; i < r ; i + + )

n b e l o w + = h e a d e r [ i + 2 ] ;

M P I F i l e h ;

M P I F i l e o p e n ( c , f , M P I M O D E C R E A T E | M P I M O D E W R O N L Y , M P I I N F O N U L L , & h ) ;

i f ( r = = p - 1 ) {

h e a d e r [ 0 ] = n b e l o w + n ;

h e a d e r [ 1 ] = p ;

M P I F i l e w r i t e ( h , h e a d e r , p + 2 , M P I I N T , M P I S T A T U S I G N O R E ) ;

}M P I F i l e s e t v i e w ( h , ( p + 2 ) * s i z e o f ( i n t ) , t , t , " n a t i v e " , M P I I N F O N U L L ) ;

M P I F i l e w r i t e a t ( h , n b e l o w , a , n , t , M P I S T A T U S I G N O R E ) ;

M P I F i l e c l o s e ( & h ) ;

}



71/75

Reading and redistributing atoms

Copy example lj code to your own directory.$ cp -r ~rzon/Courses/lj .

$ cd lj

Get your environment setup with

$ source ./envsetupType make to build.$ make

Run:$ mpirun -np 8 lj run.ini

Creates 70778.cp as a checkpoint (try ls -l).

Rerun to see that it successfully reads the checkpoint.

Run again with different # of processors. What happens?



72/75

Reading and redistributing atoms

The lj program has an option to produce an xsf file with thetrajectory, which can be displayed in vmd.

Edit run.ini, change the line 0 to 5

Careful: xsf is an ascii format, and the file quickly gets huge!

Do:mpirun -np 8 lj run.ini

Note difference in speed!

Open in vmd:$ vmd pos.xsf



73/75

Checking binary files

Interactive binary file viewer in ~rzon/Courses/lj: cbinUseful for quickly checking your binary format, without having towrite a test program.

Start the program with:

$ cbin 70778.cpGets you a prompt, with the file loaded.

Commands at the prompt are a letter plus optional number.

E.g. i2 reads 2 integers, d6 reads 6 doubles.

Has support to reverse the endian-ness, switching betweenloaded files, and moving the file pointer.

Type ? for a list of commands.

Check the format of the file.

MPI-IO


74/75

Good References on MPI-IOW. Gropp, E. Lusk, and R. Thakur, Using MPI-2: AdvancedFeatures of the Message-Passing Interface (MIT Press, 1999).

J. H. May, Parallel I/O for High Performance Computing

(Morgan Kaufmann, 2000).

W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk,B. Nitzberg, W. Saphir, and M. Snir, MPI-2: The CompleteReference: Volume 2, The MPI-2 Extensions

(MIT Press, 1998).

HDF5/NETCDF


75/75

parallel io course

Documents