parallel io course

Upload: azeraraeeazeazer

Post on 05-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Parallel Io Course

    1/75

    Parallel I/O

    SciNet

    www.scinet.utoronto.ca

    University of Toronto

    Toronto, Canada

    October 19, 2010

  • 8/2/2019 Parallel Io Course

    2/75

    Outline / Schedule

    1 Introduction

    2 File Systems and I/O

    3 Data Management

    4 Break

    5 Parallel I/O

    6 MPI-IO

    7 HDF5/NETCDF

  • 8/2/2019 Parallel Io Course

    3/75

    Disk I/O

    Common Uses

    Checkpoint/Restart Files

    Data AnalysisData Organization

    Time accurate and/or Optimization Runs

    Batch and Data processing

    Database

  • 8/2/2019 Parallel Io Course

    4/75

    Disk I/O

    Common Bottlenecks

    Mechanical disks are slow!

    System call overhead (open, close, read, write)Shared file system (nfs, lustre, gpfs, etc)

    HPC systems typically designed for high bandwidth (GB/s)not IOPs

    Uncoordinated independent accesses

  • 8/2/2019 Parallel Io Course

    5/75

    Disk Access Rates over Time

    Figure by R. Ross, Argonne National Laboratory, CScADS09

  • 8/2/2019 Parallel Io Course

    6/75

    Memory/Storage Latency

    Figure by R. Freitas and L Chiu, IBM Almaden Labs, FAST10

  • 8/2/2019 Parallel Io Course

    7/75

    Definitions

    IOPsInput/Output Operations Per Second (read,write,open,close,seek)

    I/O Bandwidth

    Quantity you read/write (think network bandwidth)

    Comparisons

    Device Bandwidth (MB/s) per-node IOPs per-node

    SATA HDD 100 100 100 100SSD HDD 250 250 4000 4000SciNet 5000 1.25 30000 7.5

  • 8/2/2019 Parallel Io Course

    8/75

    SciNet Filesystem

    File System

    1,790 1TB SATA disk drives, for a totalof 1.4PB

    Two DCS9900 couplets, each delivering:4-5 GB/s read/write access (bandwidth)30,000 IOPs max (open, close, seek, . . . )

    Single GPFS file system on TCS and GPC

    I/O goes over Gb ethernet network on GPC(infiniband on TCS)

    File system is parallel!

  • 8/2/2019 Parallel Io Course

    9/75

    I/O Software Stack

  • 8/2/2019 Parallel Io Course

    10/75

    Parallel File System

  • 8/2/2019 Parallel Io Course

    11/75

    Parallel File System

  • 8/2/2019 Parallel Io Course

    12/75

    Parallel File System

  • 8/2/2019 Parallel Io Course

    13/75

    Parallel File System

    S

  • 8/2/2019 Parallel Io Course

    14/75

    Parallel File System

    P ll l Fil S

  • 8/2/2019 Parallel Io Course

    15/75

    Parallel File System

    P ll l Fil S

  • 8/2/2019 Parallel Io Course

    16/75

    Parallel File System

    P ll l Fil S t

  • 8/2/2019 Parallel Io Course

    17/75

    Parallel File System

    P ll l Fil S t

  • 8/2/2019 Parallel Io Course

    18/75

    Parallel File System

    File Locks

    Most parallel file systems use locks to manage concurrent fileaccess

    Files are broken up into lock units

    Clients obtain locks on units that they will access before I/Ooccurs

    Enables caching on clients as well (as long as client has a

    lock, it knows its cached data is valid)Locks are reclaimed from clients when others desire access

    P ll l Fil S st

  • 8/2/2019 Parallel Io Course

    19/75

    Parallel File System

    Optimal for large shared files.

    Behaves poorly under many small reads and writes, high IOPs

    Your use of it affects everybody!(Different from case with CPU and RAM which are not

    shared.)How you read and write, your file format, the number of filesin a directory, and how often you ls, affects every user!

    The file system is shared over the ethernet network on GPC:

    Hammering the file system can hurt process communications.File systems are not infinite!Bandwidth, metadata, IOPs, number of files, space, . . .

    Parallel File System

  • 8/2/2019 Parallel Io Course

    20/75

    Parallel File System

    2 jobs doing simultaneous I/O can take much longer than

    twice a single job duration due to disk contention anddirectory locking.

    SciNet: 500+ users doing I/O from 4000 nodes.Thats a lot of sharing and contention!

    I/O Best Practices

  • 8/2/2019 Parallel Io Course

    21/75

    I/O Best Practices

    Make a plan

    Make a plan for your data needs:How much will you generate,How much do you need to save,And where will you keep it?

    Note that /scratch is temporary storage for 3 months or less.

    Options?

    1 Save on your departmental/local server/workstation(it is possible to transfer TBs per day on a gigabit link);

    2 Apply for a project space allocation at next RAC call(but space is very limited);

    3 Buy tapes through us ($100/TB) and we can archive yourdata to tape; HSM possibility within next 6 months;

    4 Change storage format.

    I/O Best Practices

  • 8/2/2019 Parallel Io Course

    22/75

    I/O Best Practices

    Monitor and control usage

    Minimize use of filesystem commands like ls and du.Regularly check your disk usage using/scinet/gpc/bin/diskUsage.

    Warning signs which should prompt careful consideration:

    More than 100,000 files in your spaceAverage file size less than 100 MB

    Monitor disk actions with top and strace

    RAM is always faster than disk; think about using ramdisk.

    Use gzip and tar to compress files to bundle many files intoone

    Try gziping your data files. 30% not atypical!

    Delete files that are no longer needed

    Do housekeeping (gzip, tar, delete) regularly.

    I/O Best Practices

  • 8/2/2019 Parallel Io Course

    23/75

    I/O Best Practices

    Dos

    Write binary format filesFaster I/O and less space than ASCII files.

    Use parallel I/O if writing from many nodes

    Maximize size of files. Large block I/O optimal!

    Minimize number of files. Makes filesystem more responsive!

    Donts

    Dont write lots of ASCII files. Lazy, slow, and wastes space!

    Dont write many hundreds of files in a 1 directory. (FileLocks)

    Dont write many small files (< 10MB).System is optimized for large-block I/O.

  • 8/2/2019 Parallel Io Course

    24/75

    1 Introduction

    2 File Systems and I/O

    3 Data Management

    4 Break

    5 Parallel I/O

    6

    MPI-IO

    7 HDF5/NETCDF

    Data Management

  • 8/2/2019 Parallel Io Course

    25/75

    Data Management

    Formats

    ASCII

    BinaryMetaData (XML)

    Databases

    Standard Librarys (HDF5,NetCDF)

    ASCII

  • 8/2/2019 Parallel Io Course

    26/75

    ASCII

    American Standard Code for Information Interchange

    Pros

    Human Readable

    Portable (architecture independent)

    Cons

    Inefficient Storage

    Expensive for Read/Write (conversions)

    Native Binary

  • 8/2/2019 Parallel Io Course

    27/75

    Native Binary

    100100100

    Pros

    Efficient Storage (256 x floats @4bytes takes 1024 bytes)

    Efficient Read/Write (native)

    Cons

    Have to know the format to read

    Portability (Endianness)

    ASCII vs. binary

  • 8/2/2019 Parallel Io Course

    28/75

    ASCII vs. binary

    Writing 128M doubles

    Format /scratch (GPCS) /dev/shm (RAM) /tmp (disk)

    ASCII 173s 174s 260sBinary 6s 1s 20s

    Syntax

    Format C FORTRAN

    ASCII fprintf() open(6,file=test,form=formatted)

    write(6,*)Binary fwrite() open(6,file=test,form=unformatted)

    write(6)

    Metadata

  • 8/2/2019 Parallel Io Course

    29/75

    What is Metadata?

    Data about DataFile System: size, location, date, owner, etc.

    App Data: File format, version, iteration, etc.

    Example: XML

    UTF1000

    6.8

    January 15th, 2010

    47 23.516 -122 02.625

    Databases

  • 8/2/2019 Parallel Io Course

    30/75

    Beyond flat files

    Very powerful and flexible storage approach

    Data organization and analysis can be greatly simplified

    Enhanced performance over seek/sort depending on usage

    Open Source Software

    SQLite (serverless)PostgreSQL

    mySQL

    Standard Formats

  • 8/2/2019 Parallel Io Course

    31/75

    CGNS (CFD General Notation System)

    IGES/STEP (CAD Geometry)

    HDF5 (Hierarchical Data Format)

    NetCDF (Network Common Data Format)

    disciplineX version

    HDF5/NetCDF - Serial

  • 8/2/2019 Parallel Io Course

    32/75

    /

    Jonathan

  • 8/2/2019 Parallel Io Course

    33/75

    1 Introduction

    2 File Systems and I/O

    3 Data Management

    4 Break

    5 Parallel I/O

    6 MPI-IO

    7 HDF5/NETCDF

  • 8/2/2019 Parallel Io Course

    34/75

    1 Introduction

    2 File Systems and I/O

    3 Data Management

    4 Break

    5 Parallel I/O

    6 MPI-IO

    7 HDF5/NETCDF

    Common Ways of Doing Parallel I/O

  • 8/2/2019 Parallel Io Course

    35/75

    /

    Sequential I/O (only proc 0 Writes/Reads)

    ProTrivially simple for small I/OSome I/O libraries not parallel

    Con

    Bandwidth limited by rate one client can sustainMay not have enough memory on node to hold all dataWont scale (built in bottleneck)

    Common Ways of Doing Parallel I/O

  • 8/2/2019 Parallel Io Course

    36/75

    N files for N Processes

    Pro

    No interprocess communication or coordination necessaryPossibly better scaling than single sequential I/O

    Con

    As process counts increase, lots of (small) files, wont scaleData often must be post-processed into one fileUncoordinated I/O may swamp file system (File LOCKS!)

    Common Ways of Doing Parallel I/O

  • 8/2/2019 Parallel Io Course

    37/75

    All Processes Access One File

    ProOnly one fileData can be stored canonically, avoiding post-processingWill scale if done correctly

    Con

    Uncoordinated I/O WILL swamp file system (File LOCKS!)Requires more design and thought

    Parallel I/O

  • 8/2/2019 Parallel Io Course

    38/75

    What is Parallel I/O?

    Multiple processes of a parallel program accessing data (reading orwriting) from a common file.

    Parallel I/O

  • 8/2/2019 Parallel Io Course

    39/75

    Why Parallel I/O?

    Non-parallel I/O is simple but:

    Poor performance (single process writes to one file)

    Awkward and not interoperable with other tools (each processwrites a separate file)

    Parallel I/O

    Higher performance through collective and contiguous I/OSingle file (visualization, data management, storage, etc)

    Works with file system not against it

    Contiguous and Noncontiguous I/O

  • 8/2/2019 Parallel Io Course

    40/75

    Contiguous I/O move from a single memory block into a single file block

    Noncontiguous I/O has three forms:

    Noncontiguous in memory, in file, or in both

    Structured data leads naturally to noncontiguous I/O(e.g. block decomposition)

    Describing noncontiguous accesses with a single operation passes moreknowledge to I/O system

    Independent and Collective I/O

  • 8/2/2019 Parallel Io Course

    41/75

    Independent I/O operations specify only what a single process will do

    calls obscure relationships between I/O on other processes

    Many applications have phases of computation and I/O

    During I/O phases, all processes read/write dataWe can say they are collectively accessing storage

    Collective I/O is coordinated access to storage by a group of processes

    functions are called by all processes participating in I/OAllows file system to know more about access as a whole, more

    optimization in lower software layers, better performance

    Parallel I/O

  • 8/2/2019 Parallel Io Course

    42/75

    Available Approaches

    MPI-IO: MPI-2 Language Standard

    HDF (Hierarchical Data Format)

    NetCDF (Network Common Data Format)Adaptable IO System (ADIOS)

    Actively developed (OLCF,SandiaNL,GeorgiaTech) and usedon largest HPC systems (Jaguar,Blue Gene/P)External to the code XML file describing the various elements

    Uses MPI-IO, can work with HDF/NetCDF

    MPI-IO

  • 8/2/2019 Parallel Io Course

    43/75

    MPI-IO

    MPI-IO

  • 8/2/2019 Parallel Io Course

    44/75

    MPIMPI: Message Passing Interface

    Language-independent communications protocol used toprogram parallel computers.

    MPI-IO: Parallel file access protocol

    MPI-IO: The parallel I/O part of the MPI-2 standard (1996).

    Many other parallel I/O solutions are built upon it.

    Versatile and better performance than standard unix I/O.Usually collective I/O is the most efficient.

    MPI-IO

  • 8/2/2019 Parallel Io Course

    45/75

    Advantages MPI-IO

    noncontiguous access of files and memory

    collective I/O

    individual and shared file pointers

    explicit offsets

    portable data representation

    can give hints to implementation/file system

    no text/formatted output!

    MPI-IO

  • 8/2/2019 Parallel Io Course

    46/75

    MPI conceptsProcess: An instance of your program, often 1 per core.

    Communicator: Groups of processes and their topology.Standard communicators:

    MPI COMM WORLD: all processes launched by mpirun.MPI COMM SELF: just this process.

    Size: the number of processes in the communicator.

    Rank: a unique number assigned to each process in thecommunicator group.

    When using MPI, each process always call MPI INIT at thebeginning and MPI FINALIZE at the end of your program.

    MPI-IO

  • 8/2/2019 Parallel Io Course

    47/75

    Basic MPI code example

    in C:# i n c l u d e < m p i . h >

    i n t m a i n ( i n t a r g c , c h a r * * a r g v )

    {i n t r a n k , n p r o c s ;

    M P I I n i t ( & a r g c , & a r g v ) ;

    M P I C o m m s i z e

    ( M P I C O M M W O R L D , & n p r o c s ) ;

    M P I C o m m r a n k

    ( M P I C O M M W O R L D , & r a n k ) ;

    . . .

    M P I F i n a l i z e ( ) ;

    r e t u r n 0 ;

    }

    in Fortran:p r o g r a m m a i n

    i n c l u d e ' m p i f . h '

    i n t e g e r r a n k , n p r o c s

    i n t e g e r i e r r

    c a l l M P I I N I T ( i e r r )

    c a l l M P I C O M M S I Z E &

    ( M P I C O M M W O R L D , n p r o c s , i e r r )

    c a l l M P I C O M M R A N K &

    ( M P I C O M M W O R L D , r a n k , i e r r )

    . . .

    c a l l M P I F I N A L I Z E ( i e r r )

    r e t u r n

    e n d

    MPI-IO

  • 8/2/2019 Parallel Io Course

    48/75

    MPI-IO exploits analogies with MPI

    Writing Sending message

    Reading Receiving message

    File access grouped via communicator: collective operationsUser defined MPI datatypes for e.g. noncontiguous data layout

    IO latency hiding much like communication latency hiding(IO may even share network with communication)

    All functionality through function calls.

    MPI-IOBasic I/O Operations - C

  • 8/2/2019 Parallel Io Course

    49/75

    i n t M P I F i l e o p e n

    (M P I C o m m

    comm,c h a r *

    filename,i n t

    amode,M P I I n f o info, M P I F i l e * fh)

    i n t M P I F i l e s e e k (M P I F i l e fh,M P I O f f s e t offset, i n t to)

    i n t M P I F i l e s e t v i e w (M P I F i l e fh, M P I O f f s e t disp,

    M P I D a t a t y p e etype,M P I D a t a t y p e

    filetype,

    c h a r * datarep, M P I I n f o info)

    i n t M P I F i l e r e a d (M P I F i l e fh, v o i d * buf, i n t count,

    M P I D a t a t y p e

    datatype,M P I S t a t u s *

    status)i n t M P I F i l e w r i t e (M P I F i l e fh, v o i d * buf, i n t count,

    M P I D a t a t y p e datatype,

    M P I S t a t u s * status)

    i n t M P I F i l e c l o s e (

    M P I F i l e * fh)

    MPI-IOBasic I/O Operations - Fortran

  • 8/2/2019 Parallel Io Course

    50/75

    M P I F I L E O P E N (comm,filename,amode,info,fh,ierr)c h a r a c t e r * ( * )

    filenamei n t e g e r comm,amode,info,fh,ierr

    M P I F I L E S E E K (fh,offset,whence,ierr)i n t e g e r ( k i n d = M P I O F F S E T K I N D )

    offseti n t e g e r fh,whence,ierr

    M P I F I L E S E T V I E W (fh,disp,etype,filetype,datarep,info,ierr)i n t e g e r ( k i n d = M P I O F F S E T K I N D ) dispi n t e g e r

    fh,etype,filetype,info,ierrc h a r a c t e r * ( * ) datarep

    M P I F I L E R E A D (fh,buf,count,datatype,status,ierr) t y p e buf(*)i n t e g e r

    fh,count,datatype,status(MPI STATUS SIZE),ierrM P I F I L E W R I T E (fh,buf,count,datatype,status,ierr) t y p e buf(*)i n t e g e r fh,count,datatype,status(MPI STATUS SIZE),ierr

    M P I F I L E C L O S E (fh)i n t e g e r fh

    MPI-IOOpening and closing a file

  • 8/2/2019 Parallel Io Course

    51/75

    Files are maintained via file handles. Open files with M P I F i l e o p e n .The following codes open a file for reading, and close it right away:

    in C:M P I F I L E f h ;

    M P I F i l e o p e n ( M P I C O M M W O R L D , " t e s t . d a t " , M P I M O D E R D O N L Y ,

    M P I I N F O N U L L , & f h ) ;

    M P I F i l e c l o s e ( & f h ) ;

    in Fortran:i n t e g e r f h , i e r r

    c a l l M P I F I L E O P E N ( M P I C O M M W O R L D , " t e s t . d a t " , &

    M P I M O D E R D O N L Y , M P I I N F O N U L L , f h , i e r r )

    c a l l M P I F I L E C L O S E ( f h , i e r r )

    MPI-IO

  • 8/2/2019 Parallel Io Course

    52/75

    Opening a file requires...

    communicator,

    file name,file handle, for all future reference to file,file mode, made up of combinations of:

    M P I M O D E R D O N L Y read only

    M P I M O D E R D W R reading and writingM P I M O D E W R O N L Y

    write onlyM P I M O D E C R E A T E create file if it does not existM P I M O D E E X C L

    error if creating file that existsM P I M O D E D E L E T E O N C L O S E

    delete file on closeM P I M O D E U N I Q U E O P E N

    file not to be opened elsewhereM P I M O D E S E Q U E N T I A L

    file to be accessed sequentiallyM P I M O D E A P P E N D position all file pointers to end

    info structure, or M P I I N F O N U L L ,In Fortran, error code is the functions last argument

    In C, the function returns the error code.

    MPI-IO

  • 8/2/2019 Parallel Io Course

    53/75

    etypes, filetypes, file viewsTo make binary access a bit more natural for many applications,MPI-IO defines file access through the following concepts:

    1 etype: Allows to access the file in units other than bytes.Some parameters have to be given in bytes.

    2 filetype: Each process defines what part of a shared file it uses.

    Filetypes specify a pattern which gets repeated in the file.Useful for noncontiguous access.For contiguous access, often etype=filetype.

    3

    displacement: Where to start in the file, in bytes.Together, these specify the file view, set by

    M P I F i l e s e t v i e w .

    Default view has etype=filetype=MPI BYTE and displacement 0.

    MPI-IOContiguous Data

  • 8/2/2019 Parallel Io Course

    54/75

    Processes

    P(0)q P(1)q P(2)q P(3)q . .

    view(0) q view(1) q view(2) q view(3) qq

    One filei n t b u f [ . . . ] ;

    M P I O f f s e t b u f s i z e = . . . ;

    M P I F i l e o p e n ( M P I C O M M W O R L D , " f i l e " , M P I M O D E W R O N L Y ,

    M P I I N F O N U L L , & f h ) ;

    M P I O f f s e t d i s p = r a n k * b u f s i z e * s i z e o f ( i n t ) ;

    M P I F i l e s e t v i e w ( f h , d i s p , M P I I N T , M P I I N T , " n a t i v e " ,

    M P I I N F O N U L L ) ;

    M P I F i l e w r i t e ( f h , b u f , b u f s i z e , M P I I N T , M P I S T A T U S I G N O R E ) ;

    M P I F i l e c l o s e ( & f h ) ;

    MPI-IOOverview of all read functions

  • 8/2/2019 Parallel Io Course

    55/75

    Single task CollectiveIndividual file pointer

    blockingM P I F i l e r e a d M P I F i l e r e a d a l l

    nonblockingM P I F i l e i r e a d M P I F i l e r e a d a l l b e g i n

    + ( M P I W a i t ) M P I F i l e r e a d a l l e n d

    Explicit offsetblocking

    M P I F i l e r e a d a t M P I F i l e r e a d a t a l l

    nonblockingM P I F i l e i r e a d a t M P I F i l e r e a d a t a l l b e g i n

    + ( M P I W a i t ) M P I F i l e r e a d a t a l l e n d

    Shared file pointer

    blockingM P I F i l e r e a d s h a r e d M P I F i l e r e a d o r d e r e d

    nonblockingM P I F i l e i r e a d s h a r e d M P I F i l e r e a d o r d e r e d b e g i n

    + ( M P I W a i t ) M P I F i l e r e a d o r d e r e d e n d

    MPI-IOOverview of all write functions

  • 8/2/2019 Parallel Io Course

    56/75

    Single task CollectiveIndividual file pointer

    blockingM P I F i l e w r i t e M P I F i l e w r i t e a l l

    nonblockingM P I F i l e i w r i t e M P I F i l e w r i t e a l l b e g i n

    + ( M P I W a i t ) M P I F i l e w r i t e a l l e n d

    Explicit offsetblocking

    M P I F i l e w r i t e a t M P I F i l e w r i t e a t a l l

    nonblockingM P I F i l e i w r i t e a t M P I F i l e w r i t e a t a l l b e g i n

    + ( M P I W a i t ) M P I F i l e w r i t e a t a l l e n d

    Shared file pointer

    blockingM P I F i l e w r i t e s h a r e d M P I F i l e w r i t e o r d e r e d

    nonblockingM P I F i l e i w r i t e s h a r e d M P I F i l e w r i t e o r d e r e d b e g i n

    + ( M P I W a i t ) M P I F i l e w r i t e o r d e r e d e n d

    MPI-IO

  • 8/2/2019 Parallel Io Course

    57/75

    Collective vs. single task

    After a file has been opened and a fileview is defined, processescan independently read and write to their part of the file.

    If the IO occurs at regular spots in the program, which differentprocesses reach the same time, it will be better to use collective

    I/O: These are the a l l versions of the MPI-IO routines.

    Two file pointers

    An MPI-IO file has two different file pointers:

    1 individual file pointer: one per process.2 shared file pointer: one per file: s h a r e d / o r d e r e d

    Shared doesnt mean collective, but does imply synchronization!

    MPI-IOStrategic considerations

  • 8/2/2019 Parallel Io Course

    58/75

    Pros for single task I/O

    One can virtually always use only indivivual file pointers,

    If timings variable, no need to wait for other processes

    Cons

    If there are interdependences between how processes write,there may be collective I/O operations may be faster.

    Collective I/O can collect data before doing the write or read.

    True speed depends on file system, size of data to write andimplementation.

    MPI-IONoncontiguous Data

  • 8/2/2019 Parallel Io Course

    59/75

    Processes

    P(0)q P(1)q P(2)q P(3)q .

    etc. .q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

    One file

    Filetypes to the rescue!

    Define a 2-etype basic MPI Datatype.

    Increase its size to 8 etypes.

    Shift according to rank to pick out the right 2 etypes.

    Use the result as the filetype in the file view.

    Then gaps are automatically skipped.

    MPI-IOOverview of data/filetype constructors

  • 8/2/2019 Parallel Io Course

    60/75

    Function Creates a. . .

    M P I T y p e c o n t i g u o u s contiguous datatype

    M P I T y p e v e c t o r vector (strided) datatypeM P I T y p e i n d e x e d

    indexed datatypeM P I T y p e i n d e x e d b l o c k indexed datatype w/uniform block lengthM P I T y p e c r e a t e s t r u c t

    structured datatypeM P I T y p e c r e a t e r e s i z e d

    type with new extent and boundsM P I T y p e c r e a t e d a r r a y

    distributed array datatypeM P I T y p e c r e a t e s u b a r r a y

    n-dim subarray of an n-dim array. . .

    Before using the create type, you have to doM P I C o m m i t

    .

    MPI-IOAccessing a noncontiguous file type

  • 8/2/2019 Parallel Io Course

    61/75

    q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

    in C:M P I D a t a t y p e c o n t i g , f t y p e ;

    M P I D a t a t y p e e t y p e = M P I I N T ;

    M P I A i n t e x t e n t = s i z e o f ( i n t ) * 8 ; / * i n b y t e s ! * /

    M P I O f f s e t d = 2 * s i z e o f ( i n t ) * r a n k ; / * i n b y t e s ! * /

    M P I T y p e c o n t i g u o u s ( 2 , e t y p e , & c o n t i g ) ;

    M P I T y p e c r e a t e r e s i z e d ( c o n t i g , 0 , e x t e n t , & f t y p e ) ;

    M P I T y p e c o m m i t ( & f t y p e ) ;

    M P I F i l e s e t v i e w ( f h , d , e t y p e , f t y p e , " n a t i v e " ,

    M P I I N F O N U L L ) ;

    MPI-IOAccessing a noncontiguous file type

  • 8/2/2019 Parallel Io Course

    62/75

    q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

    in Fortran:i n t e g e r : : e t y p e , e x t e n t , c o n t i g , f t y p e , i e r r

    i n t e g e r ( k i n d = M P I O F F S E T K I N D ) : : d

    e t y p e = M P I I N T

    e x t e n t = 4 * 8

    d = 4 * r a n k

    c a l l M P I T Y P E C O N T I G U O U S ( 2 , e t y p e , c o n t i g , i e r r )

    c a l l M P I T Y P E C R E A T E R E S I Z E D ( c o n t i g , 0 , e x t e n t , f t y p e , i e r r )

    c a l l M P I T Y P E C O M M I T ( f t y p e , i e r r )

    c a l l M P I F I L E S E T V I E W ( f h , d , e t y p e , f t y p e , " n a t i v e " ,

    M P I I N F O N U L L , i e r r )

    MPI-IO

  • 8/2/2019 Parallel Io Course

    63/75

    File data representation

    native: Data is stored in the file as it is in memory:no conversion is performed. No loss inperformance, but not portable.

    internal: Implementation dependent conversion.Portable across machines with the sameMPI implementation, but not acrossdifferent implementations.

    external32: Specific data representation, basically32-bit big-endian IEEE format. See MPI

    Standard for more info. Completelyportable, but not the best performance.

    These have to be given to MPI File set view as strings.

    MPI-IO

    More noncontiguous data: subarrays

  • 8/2/2019 Parallel Io Course

    64/75

    More noncontiguous data: subarrays

    What if theres a 2d matrix that is distributed across processes?

    proc0q proc1q proc2q proc3q .

    .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q

    .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q

    .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q

    .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q

    .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q

    .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q .q

    q

    Common cases of noncontiguous access specialized functions:M P I F i l e c r e a t e s u b a r r a y

    &M P I F i l e c r e a t e d a r r a y

    .

    MPI-IOMore noncontiguous data: subarrays

  • 8/2/2019 Parallel Io Course

    65/75

    i n t g s i z e s [ 2 ] = {1 6 , 6 };i n t l s i z e s [ 2 ] = {8 , 3 };i n t p s i z e s [ 2 ] = {2 , 2 };i n t c o o r d s [ 2 ] =

    {r a n k % p s i z e s [ 0 ] , r a n k / p s i z e s [ 0 ]

    };

    i n t s t a r t s [ 2 ] = {

    c o o r d s [ 0 ] * l s i z e s [ 0 ] , c o o r d s [ 1 ] * l s i z e s [ 1 ] }

    ;

    M P I T y p e c r e a t e s u b a r r a y ( 2 , g s i z e s , l s i z e s , s t a r t s , ,

    M P I O R D E R C , M P I I N T , & f i l e t y p e ) ;

    M P I T y p e c o m m i t ( & f i l e t y p e ) ;

    M P I F i l e s e t v i e w ( f h , 0 , M P I I N T , f i l e t y p e , " n a t i v e " ,

    M P I I N F O N U L L ) ;

    M P I F i l e w r i t e a l l ( f h , l o c a l a r r a y , l o c a l a r r a y s i z e , M P I I N T ,

    M P I S T A T U S I G N O R E ) ;

    Tip

    M P I C a r t c r e a t e can be useful to compute coords for a proc.

    MPI-IOExample: N-body MD checkpointing

  • 8/2/2019 Parallel Io Course

    66/75

    Challenge

    Simulate n-body molecular dynamics, Lennard-Jones potential.

    Parallel MPI run: atoms distributed.

    At intervals, checkpoint the state of system to 1 file.

    Restart should be allowed to use more or less mpi processes.

    Restart should be efficient.

    State of the system: total of tot atoms with properties:

    s t r u c t A t o m {

    d o u b l e q [ 3 ] ;

    d o u b l e p [ 3 ] ;

    l o n g l o n g t a g ;

    l o n g l o n g i d ;

    };

    t y p e a t o m

    d o u b l e p r e c i s i o n : : q ( 3 )

    d o u b l e p r e c i s i o n : : p ( 3 )

    i n t e g e r ( k i n d = 8 ) : : t a g

    i n t e g e r ( k i n d = 8 ) : : i d

    e n d t y p e a t o m

    MPI-IOExample: N-body MD checkpointing

  • 8/2/2019 Parallel Io Course

    67/75

    Issues

    Atom data more than array of doubles: indices etc.

    Writing problem: processes have different # of atomshow to define views, subarrays, ... ?

    Reading problem: not known which process gets which atoms.

    Even worse if number of processes is changed in restart.

    Approach

    Abstract the atom datatype.

    Compute where in file each proc. should write + how much.Store that info in header.

    Restart with same nprocs is then straightforward.

    Different nprocs: MPI exercise outside scope of 1-day class.

    MPI-IOExample: N-body MD checkpointing

  • 8/2/2019 Parallel Io Course

    68/75

    Defining the Atom etype

    s t r u c t A t o m a t o m ;

    i n t i , l e n [ 4 ] = {

    3 , 3 , 1 , 1 }

    ;

    M P I A i n t a d d r [ 5 ] ;

    M P I A i n t d i s p [ 4 ] ;

    M P I G e t a d d r e s s ( & a t o m , & ( a d d r [ 0 ] ) ) ;

    M P I G e t a d d r e s s ( & ( a t o m . q [ 0 ] ) , & ( a d d r [ 1 ] ) ) ;

    M P I G e t a d d r e s s ( & ( a t o m . p [ 0 ] ) , & ( a d d r [ 2 ] ) ) ;

    M P I G e t a d d r e s s ( & a t o m . t a g , & ( a d d r [ 3 ] ) ) ;

    M P I G e t a d d r e s s ( & a t o m . i d , & ( a d d r [ 4 ] ) ) ;

    f o r ( i = 0 ; i < 4 ; i + + )

    d i s p [ i ] = a d d r [ i ] - a d d r [ 0 ] ;

    M P I D a t a t y p e t [ 4 ]

    ={

    M P I D O U B L E , M P I D O U B L E , M P I L O N G L O N G , M P I L O N G L O N G }

    ;

    M P I T y p e c r e a t e s t r u c t ( 4 , l e n , d i s p , t , & M P I A T O M ) ;

    M P I T y p e c o m m i t ( & M P I A T O M ) ;

    MPI-IOExample: N-body MD checkpointing

  • 8/2/2019 Parallel Io Course

    69/75

    File structuretotq pq n[0]q n[1]q.. n[p-1]q n[0]atomsq n[1]atomsq.. n[p-1]atomsq

    header: integers

    body: atoms

    Functions

    v o i d c p w r i t e ( s t r u c t A t o m * a, i n t n, M P I D a t a t y p e t,c h a r * f, M P I C o m m c)

    v o i d c p r e a d ( s t r u c t A t o m * a, i n t * n, i n t tot,M P I D a t a t y p e t, c h a r * f, M P I C o m m c)

    v o i d r e d i s t r i b u t e () consider done

    MPI-IO: Checkpoint writing

    v o i d c p w r i t e ( s t r u c t A t o m * a , i n t n , M P I D a t a t y p e t , c h a r * f , M P I C o m m c ) {

  • 8/2/2019 Parallel Io Course

    70/75

    i n t p , r ;

    M P I C o m m s i z e ( c , & p ) ;

    M P I C o m m r a n k ( c , & r ) ;

    i n t h e a d e r [ p + 2 ] ;

    M P I A l l g a t h e r ( & n , 1 , M P I I N T , & ( h e a d e r [ 2 ] ) , 1 , M P I I N T , c ) ;

    i n t i , n b e l o w = 0 ;

    f o r ( i = 0 ; i < r ; i + + )

    n b e l o w + = h e a d e r [ i + 2 ] ;

    M P I F i l e h ;

    M P I F i l e o p e n ( c , f , M P I M O D E C R E A T E | M P I M O D E W R O N L Y , M P I I N F O N U L L , & h ) ;

    i f ( r = = p - 1 ) {

    h e a d e r [ 0 ] = n b e l o w + n ;

    h e a d e r [ 1 ] = p ;

    M P I F i l e w r i t e ( h , h e a d e r , p + 2 , M P I I N T , M P I S T A T U S I G N O R E ) ;

    }M P I F i l e s e t v i e w ( h , ( p + 2 ) * s i z e o f ( i n t ) , t , t , " n a t i v e " , M P I I N F O N U L L ) ;

    M P I F i l e w r i t e a t ( h , n b e l o w , a , n , t , M P I S T A T U S I G N O R E ) ;

    M P I F i l e c l o s e ( & h ) ;

    }

    MPI-IOExample: N-body MD checkpointing

  • 8/2/2019 Parallel Io Course

    71/75

    Reading and redistributing atoms

    Copy example lj code to your own directory.$ cp -r ~rzon/Courses/lj .

    $ cd lj

    Get your environment setup with

    $ source ./envsetupType make to build.$ make

    Run:$ mpirun -np 8 lj run.ini

    Creates 70778.cp as a checkpoint (try ls -l).

    Rerun to see that it successfully reads the checkpoint.

    Run again with different # of processors. What happens?

    MPI-IOExample: N-body MD checkpointing

  • 8/2/2019 Parallel Io Course

    72/75

    Reading and redistributing atoms

    The lj program has an option to produce an xsf file with thetrajectory, which can be displayed in vmd.

    Edit run.ini, change the line 0 to 5

    Careful: xsf is an ascii format, and the file quickly gets huge!

    Do:mpirun -np 8 lj run.ini

    Note difference in speed!

    Open in vmd:$ vmd pos.xsf

    MPI-IOExample: N-body MD checkpointing

  • 8/2/2019 Parallel Io Course

    73/75

    Checking binary files

    Interactive binary file viewer in ~rzon/Courses/lj: cbinUseful for quickly checking your binary format, without having towrite a test program.

    Start the program with:

    $ cbin 70778.cpGets you a prompt, with the file loaded.

    Commands at the prompt are a letter plus optional number.

    E.g. i2 reads 2 integers, d6 reads 6 doubles.

    Has support to reverse the endian-ness, switching betweenloaded files, and moving the file pointer.

    Type ? for a list of commands.

    Check the format of the file.

    MPI-IO

  • 8/2/2019 Parallel Io Course

    74/75

    Good References on MPI-IOW. Gropp, E. Lusk, and R. Thakur, Using MPI-2: AdvancedFeatures of the Message-Passing Interface (MIT Press, 1999).

    J. H. May, Parallel I/O for High Performance Computing

    (Morgan Kaufmann, 2000).

    W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk,B. Nitzberg, W. Saphir, and M. Snir, MPI-2: The CompleteReference: Volume 2, The MPI-2 Extensions

    (MIT Press, 1998).

    HDF5/NETCDF

  • 8/2/2019 Parallel Io Course

    75/75