cosc 6374 parallel computation collective i/o and...

1

Edgar Gabriel

COSC 6374

Parallel Computation

Collective I/O and

Scientific Data Libraries

Mohamad Chaarawi and Edgar Gabriel

Spring 2009

COSC 6374 – Parallel Computation

Edgar Gabriel

Collective I/O Operations

• Same notion as collective communication

• All processes in the communicator participate in the

operation

• How do we benefit from Collective I/O?

2


Edgar Gabriel

Developing Collective I/O Algorithms

• Three classes of algorithms will be explained

– Dynamic segmentation algorithms

– Static segmentation algorithms

– Individual algorithms

• Examples using MPI_File_write_all

int MPI_File_write_all(MPI_File fh, void *buf, int count, MPI_Datatype datatype,

MPI_Status *status)


Edgar Gabriel

Dynamic Segmentation Algorithm

• Group processes according to the number of writers

• All processes share location information about the data

to be written within their group (MPI_Allgather)

– File offsets

– Number of elements to be written

• Sort these lists in an ascending order of the file offsets.

• Write a fixed number of bytes to disk in each cycle.

– Cycle buffer size

– Using (MPI_Gatherv), each process sends its elements

contributing in the current cycle to the writer process

assigned to him.

3


Edgar Gabriel

Fixed vs. Scaling CBS

• Scaling:

– Each writer writes the specified CBS in one cycle, so in

each cycle, the total amount of data written to disk is:

CBS * number_of_writers

• Fixed:

– The cycle buffer size is divided between all writers, so in

every cycle, each writer would write:

CBS / number_of_writers


Edgar Gabriel

Example: Write Dynamic Segmentation

1 2 3 4 5 6

Process 0 (writer)

7 8 9 10 11 12

Process 1

13 14 15 16 17 18

Process 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cycle 1

Cycle 2

Cycle 3

Per process: 6 KBCycle Size: 5 KB

Each block is 1 KB

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18Cycle 4

1 2 3 4 5

1 2 3 4 5 6 7 8 9 10

4


Edgar Gabriel

Static Segmentation Algorithm

• Data is gathered from all processes at a root process

which will perform the low-level write operation.

• Data is written in fixed chunks, with the size of the

chunk being a configurable parameter.

– The root process gathers a fixed number of bytes from all

processes in each cycle.

• Due to the fact that every process contributes in every

cycle a constant amount of data, this algorithm makes

a better use of the communication resources in the

cluster.


Edgar Gabriel

Static Segmentation Algorithm

• This algorithm would be good for the following

scenarios:

– With huge caches on the I/O nodes, storage devices are

decoupled from the compute cluster and thus show, from

the application perspective, virtually no sensitivity to

irregular or strided file access patterns.

– solid state hard drives (SSD): insensitive to irregular

access in the file.

5


Edgar Gabriel

Example: Write Static Segmentation

1 2 3 4 5 6

Process 0 (writer)

7 8 9 10 11 12

Process 1

13 14 15 16 17 18

Process 2

1 2 7 8 13 14

1 2 3 4 7 8 9 10 13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Cycle 1

Cycle 2

Cycle 3

Per process: 6 KBCycle Size: 2 KB

Each block is 1 KB


Edgar Gabriel

Individual Algorithm

• Avoid communication operations entirely and has each

process writes its data individually to the hard drive.

• Extended by using a scheduling approach to control the

number of processes concurrently performing I/O

operations, and thus limit the burden on meta-data

servers for some file systems.

6


Edgar Gabriel

Test Environment

Process 0 Process 1 Process 23…Meta-Dataserver

Front-endnode

Node 0 Node 1

…

Node 23

IB

GE

File Systems (PVFS2, NFS)


Edgar Gabriel

Test Case

• A simple benchmark where processes collectively write max_size bytes of data for a given number of

iterations to the file

• We measure the execution time of the test required to

write ALL data to file

• Each test has been executed three times, taking the

maximum bandwidth achieved

• Need to check the difference in performance while

varying algorithms/parameters and file system

7


Edgar Gabriel

PVFS2 Results

• Each process executes MPI_File_write_all operations

writing 20MB of data per function call, writing all-in-all

1GB of data to file. Thus, the overall file size for the 24

processes test cases is 24GB and for the 48 processes

test cases is 48 GB.


Edgar Gabriel

CBS=20MB – 24 Processes

8


Edgar Gabriel

CBS=20MB – 48 Processes


Edgar Gabriel

SS=20MB – 24 Processes

9


Edgar Gabriel

NFS Results

• Each process writes 100MB of data, in order to keep the

execution time within a reasonable time.

• The segment size is kept constant at 2MB

• The cycle buffer size is varied between 1MB, 10MB, and

20MB


Edgar Gabriel


10


Edgar Gabriel



Edgar Gabriel

Tuning Methodologies

• Results show a lot of factors affect the performance of

collective I/O operations:– File System

– Network interconnect

– Number of processes

– Size of file

– Algorithmic parameters (SS, CBS)

• Static tuning– Prior to the execution of the applications, tune for the best

algorithmic/parametric properties for a certain platform

• Dynamic tuning:– Tune at runtime

11


Edgar Gabriel

Motivation

• MPI I/O is good

– It knows about data types (=> data conversion)

– It can optimize various access patterns in applications

• MPI I/O is bad

– It does not store any information about the data type

• A file written as MPI_INT can be read as

MPI_DOUBLE in another application

• No information is stored, whether it is a two-

dimensional data array or anything else


Edgar Gabriel

Scientific data libraries

• Handle data on a higher level

• Add more information to the data (Metadata)

– Size of data structure

– Information about the numerical format

– Read and write data structures by name

– …or add units to your data

• Two widely used libraries available

– NetCDF

– HDF-5

12


Edgar Gabriel

HDF-5

• Hierarchical Data Format (HDF) developed since 1988

at NCSA (University of Illinois)

– http://hdf.ncsa.uiuc.edu/HDF5/

• Has gone through a long history of changes, the recent

version HDF-5 available since 1999

• HDF-5 supports

– Very large files

– Parallel I/O interface

– Fortran, C, Java bindings


Edgar Gabriel

HDF-5 dataset

• Multi-dimensional array of basic data elements

• A dataset consists of

– Header + data

• Header consists of

– Name

– Datatype : basic (e.g. HDF_NATIVE_FLOAT) or

compound dataypes

– Dataspace: defines size and shape of a multidimensional

array. Dimensions can be fixed or unlimited.

– Storage layout: defines how multidimensional arrays are

stored in file. Can be contiguous or chunked.

13


Edgar Gabriel

Example of an HDF-5 fileHDF5 “tempseries.h5” {

GROUP “/” {

GROUP “tempseries” {

DATASET “height” {

DATATYPE {“H5_STD_I32BE” }

DATASPACE ( ARRAY (4) (4) }

DATA {

0, 50, 100, 150

}

ATTRIBUTES “units” {

DATATYPE {“undefined string” }

DATASPACE { ARRAY (0) (0) }

DATA {

unable to print

}

}

}

DATASET “temperature” {

DATATYPE {“H5T_IEEE_F32BE” }

DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) }

DATA {…}


Edgar Gabriel

Storage layout: contiguous vs. chunkedcontiguous chunked

1

9

17

25

33

41

49

57

2

10

18

26

34

42

50

58

3

11

19

27

35

43

51

59

4

12

20

28

36

44

52

60

5

13

21

29

37

45

53

61

6

14

22

30

38

46

54

62

7

15

23

31

39

47

55

63

8

16

24

32

40

48

56

64

1

5

9

13

33

37

41

45

2

6

10

14

34

38

42

46

3

7

11

15

35

39

43

47

4

8

12

16

36

40

44

48

21

25

29

49

57

61

22

26

30

50

58

62

23

27

31

51

59

63

24

28

32

52

60

64

17 18 19 20

53 54 55 56

� Advantages and disadvantages of chunking� Accessing rows and columns require the same

number of accesses� Data can be extended into all dimensions� Efficient storage of sparse arrays� Can improve caching

14


Edgar Gabriel

HDF-5 groups

• A HDF-5 group is a collection of data sets

– Comparable to a directory in a UNIX-like file system

• HDF-5 naming convention

– All API functions start with an H5

– The next character identifies category of functions

• H5F: functions handling files

• H5G: functions handling groups

• H5D: functions handling datasets

• H5S: functions handling dataspaces

• H5A: functions handling attributes


Edgar Gabriel

Typical steps for writing a sequential

HDF-5 file

1. Create the file

2. Create a group (opt.)

3. Define a dataspace

4. Define datatype

5. Create dataset

6. Add attributes

7. Write data

8. Close all objects

h5file = H5Fcreate(…)

group = H5Gcreate (h5file,…)

tspace = H5Screate_simple(ndims, dims, maxdims );

ttype = H5T_IEEE_F32BE;

tset = H5Dcreate (group, “testset”, ttype, tspace, …);

tattr = H5Acreate (tset, “units”, H5T_C_S1, …)

H5Awrite (tattr, H5T_C_S1, “meter”);

H5Dwrite(tset,H5T_IEEE_F32BE,…,data);

15


Edgar Gabriel

Reading an HDF-5 file – structure of the

file known

1. Open the file

2. Open the group

3. Open each dataset in

the group

4. Look up dimensions

5. Read data

6. Read attributes

7. Read comments

8. Close all objects

h5file = H5Fopen(…)

group = H5Gopen(h5file,”tempseries”)

tset = H5Dopen(group,”temperature”);

tspace = H5Dget_space( tset );

H5Sget_simple_extent_dims (tspace, dims, …);

H5Dread(tset,H5T_IEEE_F32BE, ttype, tspace, …, buffer);

tattr = H5Aopen_name(tset, “units”);

attrtype = H5Aget_type ( tattr );

H5Aread(tattr,attrtype,attr);


Edgar Gabriel

Compound Datatypes

• Abstraction for user structures

– Has a fixed size

– Each member has its own name, datatype, reference, and

byte offset

h5type = H5Tcreate( H5T_class class, size_t size);

H5Tinsert ( h5type, const char *name, off_t offset, hid_t field_id);

16


Edgar Gabriel

Hyperslab

• A hyperslab is a portion of a dataset

– Operator: H5S_SELECT_SET, H5S_SELECT_OR

– Start: array determining the starting coordinates of the hyperslab

– Stride: array indicating which elements along a dimension are to

be selected

– Count: array determining how many points to use in each

dimension

– Block: array determining the size of the element block by the

datatype

H5Sselect_hyperslab (hid_t space_id, H5S_seloper_t operator, const hssize_t *start, const hsize_t *stride, const hsize_t *count, const hsize_t *block);


Edgar Gabriel

Example using hyperslabs

/* Define hyperslab in the dataset. */

offset[0] = 1; count[0] = NX_SUB;

offset[1] = 2; count[1] = NY_SUB;

status = H5Sselect_hyperslab (dataspace, H5S_SELECT_SET, offset,

NULL, count, NULL);

/*Read data from hyperslab in file into hyperslab in memory */

status = H5Dread (dataset, H5T_NATIVE_INT, memspace, dataspace,

H5P_DEFAULT, data_out);

offset[0]

offset[1]

count[0]

count[1]Examples taken from HDF-5 webpage

17


Edgar Gabriel

More complex example using

hyperslabsMemory

Process 0

Process 1

File

count[0] = 1;

count[1] = dimsmem[1];

block[0] = dimsfile[0];

block[1] = 1;

offset[0] = 0;

offset[1] = mpi_rank;

stride[0] = 1;

stride[1] = 2;

dimsmem[1]

block[0]

dimsmem[0]

For dimension x:you generate count[x] entries of block[x] elements starting from offset[x]. The distance between each element is stride[x].

offset[0]

offset[1] on rank0 1

Examples taken from HDF-5 webpage


Edgar Gabriel

Parallel I/O with HDF-5

• Relies on MPI I/O

• Program has to use special properties (hints) indicating

to use parallel I/O during

– File creation and file open

– Data access

• Properties are set through H5P… functions

/* example for using file properties */

fileprops = H5Pcreate (H5P_FILE_ACCESS);

H5Pset_fapl_mpio ( fileprops, MPI_COMM_WORLD, MPI_INFO_NULL);

h5file = H5Fcreate ( “tempseries.h5”, …, fileprops);

18


Edgar Gabriel

Parallel data access in HDF-5

• Application has to define a set of interleaved file data

spaces on the processes that will access the file

– Similar technique like setting the file-view in MPI I/O

– Usually based on defining hyperslabs

• Data transfer properties have to be set

/* example for using file properties */

fileprops = H5Pcreate (H5P_DATASET_XFER);

H5Pset_dxpl_mpio ( fileprops, H5FD_MPIO_COLLECTIVE);

H5Dwrite ( tset, …, fileprops);

cosc 6374 parallel computation collective i/o and...

Documents