cosc 6374 parallel computation collective i/o and...

18
1 Edgar Gabriel COSC 6374 Parallel Computation Collective I/O and Scientific Data Libraries Mohamad Chaarawi and Edgar Gabriel Spring 2009 COSC 6374 – Parallel Computation Edgar Gabriel Collective I/O Operations Same notion as collective communication All processes in the communicator participate in the operation How do we benefit from Collective I/O?

Upload: others

Post on 26-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

1

Edgar Gabriel

COSC 6374

Parallel Computation

Collective I/O and

Scientific Data Libraries

Mohamad Chaarawi and Edgar Gabriel

Spring 2009

COSC 6374 – Parallel Computation

Edgar Gabriel

Collective I/O Operations

• Same notion as collective communication

• All processes in the communicator participate in the

operation

• How do we benefit from Collective I/O?

Page 2: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

2

COSC 6374 – Parallel Computation

Edgar Gabriel

Developing Collective I/O Algorithms

• Three classes of algorithms will be explained

– Dynamic segmentation algorithms

– Static segmentation algorithms

– Individual algorithms

• Examples using MPI_File_write_all

int MPI_File_write_all(MPI_File fh, void *buf, int count, MPI_Datatype datatype,

MPI_Status *status)

COSC 6374 – Parallel Computation

Edgar Gabriel

Dynamic Segmentation Algorithm

• Group processes according to the number of writers

• All processes share location information about the data

to be written within their group (MPI_Allgather)

– File offsets

– Number of elements to be written

• Sort these lists in an ascending order of the file offsets.

• Write a fixed number of bytes to disk in each cycle.

– Cycle buffer size

– Using (MPI_Gatherv), each process sends its elements

contributing in the current cycle to the writer process

assigned to him.

Page 3: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

3

COSC 6374 – Parallel Computation

Edgar Gabriel

Fixed vs. Scaling CBS

• Scaling:

– Each writer writes the specified CBS in one cycle, so in

each cycle, the total amount of data written to disk is:

CBS * number_of_writers

• Fixed:

– The cycle buffer size is divided between all writers, so in

every cycle, each writer would write:

CBS / number_of_writers

COSC 6374 – Parallel Computation

Edgar Gabriel

Example: Write Dynamic Segmentation

1 2 3 4 5 6

Process 0 (writer)

7 8 9 10 11 12

Process 1

13 14 15 16 17 18

Process 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cycle 1

Cycle 2

Cycle 3

Per process: 6 KBCycle Size: 5 KB

Each block is 1 KB

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18Cycle 4

1 2 3 4 5

1 2 3 4 5 6 7 8 9 10

Page 4: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

4

COSC 6374 – Parallel Computation

Edgar Gabriel

Static Segmentation Algorithm

• Data is gathered from all processes at a root process

which will perform the low-level write operation.

• Data is written in fixed chunks, with the size of the

chunk being a configurable parameter.

– The root process gathers a fixed number of bytes from all

processes in each cycle.

• Due to the fact that every process contributes in every

cycle a constant amount of data, this algorithm makes

a better use of the communication resources in the

cluster.

COSC 6374 – Parallel Computation

Edgar Gabriel

Static Segmentation Algorithm

• This algorithm would be good for the following

scenarios:

– With huge caches on the I/O nodes, storage devices are

decoupled from the compute cluster and thus show, from

the application perspective, virtually no sensitivity to

irregular or strided file access patterns.

– solid state hard drives (SSD): insensitive to irregular

access in the file.

Page 5: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

5

COSC 6374 – Parallel Computation

Edgar Gabriel

Example: Write Static Segmentation

1 2 3 4 5 6

Process 0 (writer)

7 8 9 10 11 12

Process 1

13 14 15 16 17 18

Process 2

1 2 7 8 13 14

1 2 3 4 7 8 9 10 13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Cycle 1

Cycle 2

Cycle 3

Per process: 6 KBCycle Size: 2 KB

Each block is 1 KB

COSC 6374 – Parallel Computation

Edgar Gabriel

Individual Algorithm

• Avoid communication operations entirely and has each

process writes its data individually to the hard drive.

• Extended by using a scheduling approach to control the

number of processes concurrently performing I/O

operations, and thus limit the burden on meta-data

servers for some file systems.

Page 6: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

6

COSC 6374 – Parallel Computation

Edgar Gabriel

Test Environment

Process 0 Process 1 Process 23…Meta-Dataserver

Front-endnode

Node 0 Node 1

Node 23

IB

GE

File Systems (PVFS2, NFS)

COSC 6374 – Parallel Computation

Edgar Gabriel

Test Case

• A simple benchmark where processes collectively write max_size bytes of data for a given number of

iterations to the file

• We measure the execution time of the test required to

write ALL data to file

• Each test has been executed three times, taking the

maximum bandwidth achieved

• Need to check the difference in performance while

varying algorithms/parameters and file system

Page 7: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

7

COSC 6374 – Parallel Computation

Edgar Gabriel

PVFS2 Results

• Each process executes MPI_File_write_all operations

writing 20MB of data per function call, writing all-in-all

1GB of data to file. Thus, the overall file size for the 24

processes test cases is 24GB and for the 48 processes

test cases is 48 GB.

COSC 6374 – Parallel Computation

Edgar Gabriel

CBS=20MB – 24 Processes

Page 8: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

8

COSC 6374 – Parallel Computation

Edgar Gabriel

CBS=20MB – 48 Processes

COSC 6374 – Parallel Computation

Edgar Gabriel

SS=20MB – 24 Processes

Page 9: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

9

COSC 6374 – Parallel Computation

Edgar Gabriel

NFS Results

• Each process writes 100MB of data, in order to keep the

execution time within a reasonable time.

• The segment size is kept constant at 2MB

• The cycle buffer size is varied between 1MB, 10MB, and

20MB

COSC 6374 – Parallel Computation

Edgar Gabriel

SS=2MB – 24 Processes

Page 10: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

10

COSC 6374 – Parallel Computation

Edgar Gabriel

SS=2MB – 48 Processes

COSC 6374 – Parallel Computation

Edgar Gabriel

Tuning Methodologies

• Results show a lot of factors affect the performance of

collective I/O operations:– File System

– Network interconnect

– Number of processes

– Size of file

– Algorithmic parameters (SS, CBS)

• Static tuning– Prior to the execution of the applications, tune for the best

algorithmic/parametric properties for a certain platform

• Dynamic tuning:– Tune at runtime

Page 11: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

11

COSC 6374 – Parallel Computation

Edgar Gabriel

Motivation

• MPI I/O is good

– It knows about data types (=> data conversion)

– It can optimize various access patterns in applications

• MPI I/O is bad

– It does not store any information about the data type

• A file written as MPI_INT can be read as

MPI_DOUBLE in another application

• No information is stored, whether it is a two-

dimensional data array or anything else

COSC 6374 – Parallel Computation

Edgar Gabriel

Scientific data libraries

• Handle data on a higher level

• Add more information to the data (Metadata)

– Size of data structure

– Information about the numerical format

– Read and write data structures by name

– …or add units to your data

• Two widely used libraries available

– NetCDF

– HDF-5

Page 12: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

12

COSC 6374 – Parallel Computation

Edgar Gabriel

HDF-5

• Hierarchical Data Format (HDF) developed since 1988

at NCSA (University of Illinois)

– http://hdf.ncsa.uiuc.edu/HDF5/

• Has gone through a long history of changes, the recent

version HDF-5 available since 1999

• HDF-5 supports

– Very large files

– Parallel I/O interface

– Fortran, C, Java bindings

COSC 6374 – Parallel Computation

Edgar Gabriel

HDF-5 dataset

• Multi-dimensional array of basic data elements

• A dataset consists of

– Header + data

• Header consists of

– Name

– Datatype : basic (e.g. HDF_NATIVE_FLOAT) or

compound dataypes

– Dataspace: defines size and shape of a multidimensional

array. Dimensions can be fixed or unlimited.

– Storage layout: defines how multidimensional arrays are

stored in file. Can be contiguous or chunked.

Page 13: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

13

COSC 6374 – Parallel Computation

Edgar Gabriel

Example of an HDF-5 fileHDF5 “tempseries.h5” {

GROUP “/” {

GROUP “tempseries” {

DATASET “height” {

DATATYPE {“H5_STD_I32BE” }

DATASPACE ( ARRAY (4) (4) }

DATA {

0, 50, 100, 150

}

ATTRIBUTES “units” {

DATATYPE {“undefined string” }

DATASPACE { ARRAY (0) (0) }

DATA {

unable to print

}

}

}

DATASET “temperature” {

DATATYPE {“H5T_IEEE_F32BE” }

DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) }

DATA {…}

COSC 6374 – Parallel Computation

Edgar Gabriel

Storage layout: contiguous vs. chunkedcontiguous chunked

1

9

17

25

33

41

49

57

2

10

18

26

34

42

50

58

3

11

19

27

35

43

51

59

4

12

20

28

36

44

52

60

5

13

21

29

37

45

53

61

6

14

22

30

38

46

54

62

7

15

23

31

39

47

55

63

8

16

24

32

40

48

56

64

1

5

9

13

33

37

41

45

2

6

10

14

34

38

42

46

3

7

11

15

35

39

43

47

4

8

12

16

36

40

44

48

21

25

29

49

57

61

22

26

30

50

58

62

23

27

31

51

59

63

24

28

32

52

60

64

17 18 19 20

53 54 55 56

� Advantages and disadvantages of chunking� Accessing rows and columns require the same

number of accesses� Data can be extended into all dimensions� Efficient storage of sparse arrays� Can improve caching

Page 14: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

14

COSC 6374 – Parallel Computation

Edgar Gabriel

HDF-5 groups

• A HDF-5 group is a collection of data sets

– Comparable to a directory in a UNIX-like file system

• HDF-5 naming convention

– All API functions start with an H5

– The next character identifies category of functions

• H5F: functions handling files

• H5G: functions handling groups

• H5D: functions handling datasets

• H5S: functions handling dataspaces

• H5A: functions handling attributes

COSC 6374 – Parallel Computation

Edgar Gabriel

Typical steps for writing a sequential

HDF-5 file

1. Create the file

2. Create a group (opt.)

3. Define a dataspace

4. Define datatype

5. Create dataset

6. Add attributes

7. Write data

8. Close all objects

h5file = H5Fcreate(…)

group = H5Gcreate (h5file,…)

tspace = H5Screate_simple(ndims, dims, maxdims );

ttype = H5T_IEEE_F32BE;

tset = H5Dcreate (group, “testset”, ttype, tspace, …);

tattr = H5Acreate (tset, “units”, H5T_C_S1, …)

H5Awrite (tattr, H5T_C_S1, “meter”);

H5Dwrite(tset,H5T_IEEE_F32BE,…,data);

Page 15: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

15

COSC 6374 – Parallel Computation

Edgar Gabriel

Reading an HDF-5 file – structure of the

file known

1. Open the file

2. Open the group

3. Open each dataset in

the group

4. Look up dimensions

5. Read data

6. Read attributes

7. Read comments

8. Close all objects

h5file = H5Fopen(…)

group = H5Gopen(h5file,”tempseries”)

tset = H5Dopen(group,”temperature”);

tspace = H5Dget_space( tset );

H5Sget_simple_extent_dims (tspace, dims, …);

H5Dread(tset,H5T_IEEE_F32BE, ttype, tspace, …, buffer);

tattr = H5Aopen_name(tset, “units”);

attrtype = H5Aget_type ( tattr );

H5Aread(tattr,attrtype,attr);

COSC 6374 – Parallel Computation

Edgar Gabriel

Compound Datatypes

• Abstraction for user structures

– Has a fixed size

– Each member has its own name, datatype, reference, and

byte offset

h5type = H5Tcreate( H5T_class class, size_t size);

H5Tinsert ( h5type, const char *name, off_t offset, hid_t field_id);

Page 16: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

16

COSC 6374 – Parallel Computation

Edgar Gabriel

Hyperslab

• A hyperslab is a portion of a dataset

– Operator: H5S_SELECT_SET, H5S_SELECT_OR

– Start: array determining the starting coordinates of the hyperslab

– Stride: array indicating which elements along a dimension are to

be selected

– Count: array determining how many points to use in each

dimension

– Block: array determining the size of the element block by the

datatype

H5Sselect_hyperslab (hid_t space_id, H5S_seloper_t operator, const hssize_t *start, const hsize_t *stride, const hsize_t *count, const hsize_t *block);

COSC 6374 – Parallel Computation

Edgar Gabriel

Example using hyperslabs

/* Define hyperslab in the dataset. */

offset[0] = 1; count[0] = NX_SUB;

offset[1] = 2; count[1] = NY_SUB;

status = H5Sselect_hyperslab (dataspace, H5S_SELECT_SET, offset,

NULL, count, NULL);

/*Read data from hyperslab in file into hyperslab in memory */

status = H5Dread (dataset, H5T_NATIVE_INT, memspace, dataspace,

H5P_DEFAULT, data_out);

offset[0]

offset[1]

count[0]

count[1]Examples taken from HDF-5 webpage

Page 17: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

17

COSC 6374 – Parallel Computation

Edgar Gabriel

More complex example using

hyperslabsMemory

Process 0

Process 1

File

count[0] = 1;

count[1] = dimsmem[1];

block[0] = dimsfile[0];

block[1] = 1;

offset[0] = 0;

offset[1] = mpi_rank;

stride[0] = 1;

stride[1] = 2;

dimsmem[1]

block[0]

dimsmem[0]

For dimension x:you generate count[x] entries of block[x] elements starting from offset[x]. The distance between each element is stride[x].

offset[0]

offset[1] on rank0 1

Examples taken from HDF-5 webpage

COSC 6374 – Parallel Computation

Edgar Gabriel

Parallel I/O with HDF-5

• Relies on MPI I/O

• Program has to use special properties (hints) indicating

to use parallel I/O during

– File creation and file open

– Data access

• Properties are set through H5P… functions

/* example for using file properties */

fileprops = H5Pcreate (H5P_FILE_ACCESS);

H5Pset_fapl_mpio ( fileprops, MPI_COMM_WORLD, MPI_INFO_NULL);

h5file = H5Fcreate ( “tempseries.h5”, …, fileprops);

Page 18: COSC 6374 Parallel Computation Collective I/O and ...gabriel/courses/cosc6374_s09/ParCo_16_ScientificD… · COSC 6374 –Parallel Computation Edgar Gabriel PVFS2 Results • Each

18

COSC 6374 – Parallel Computation

Edgar Gabriel

Parallel data access in HDF-5

• Application has to define a set of interleaved file data

spaces on the processes that will access the file

– Similar technique like setting the file-view in MPI I/O

– Usually based on defining hyperslabs

• Data transfer properties have to be set

/* example for using file properties */

fileprops = H5Pcreate (H5P_DATASET_XFER);

H5Pset_dxpl_mpio ( fileprops, H5FD_MPIO_COLLECTIVE);

H5Dwrite ( tset, …, fileprops);