cosc 6374 parallel computation collective i/o and...
TRANSCRIPT
1
Edgar Gabriel
COSC 6374
Parallel Computation
Collective I/O and
Scientific Data Libraries
Mohamad Chaarawi and Edgar Gabriel
Spring 2009
COSC 6374 – Parallel Computation
Edgar Gabriel
Collective I/O Operations
• Same notion as collective communication
• All processes in the communicator participate in the
operation
• How do we benefit from Collective I/O?
2
COSC 6374 – Parallel Computation
Edgar Gabriel
Developing Collective I/O Algorithms
• Three classes of algorithms will be explained
– Dynamic segmentation algorithms
– Static segmentation algorithms
– Individual algorithms
• Examples using MPI_File_write_all
int MPI_File_write_all(MPI_File fh, void *buf, int count, MPI_Datatype datatype,
MPI_Status *status)
COSC 6374 – Parallel Computation
Edgar Gabriel
Dynamic Segmentation Algorithm
• Group processes according to the number of writers
• All processes share location information about the data
to be written within their group (MPI_Allgather)
– File offsets
– Number of elements to be written
• Sort these lists in an ascending order of the file offsets.
• Write a fixed number of bytes to disk in each cycle.
– Cycle buffer size
– Using (MPI_Gatherv), each process sends its elements
contributing in the current cycle to the writer process
assigned to him.
3
COSC 6374 – Parallel Computation
Edgar Gabriel
Fixed vs. Scaling CBS
• Scaling:
– Each writer writes the specified CBS in one cycle, so in
each cycle, the total amount of data written to disk is:
CBS * number_of_writers
• Fixed:
– The cycle buffer size is divided between all writers, so in
every cycle, each writer would write:
CBS / number_of_writers
COSC 6374 – Parallel Computation
Edgar Gabriel
Example: Write Dynamic Segmentation
1 2 3 4 5 6
Process 0 (writer)
7 8 9 10 11 12
Process 1
13 14 15 16 17 18
Process 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cycle 1
Cycle 2
Cycle 3
Per process: 6 KBCycle Size: 5 KB
Each block is 1 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18Cycle 4
1 2 3 4 5
1 2 3 4 5 6 7 8 9 10
4
COSC 6374 – Parallel Computation
Edgar Gabriel
Static Segmentation Algorithm
• Data is gathered from all processes at a root process
which will perform the low-level write operation.
• Data is written in fixed chunks, with the size of the
chunk being a configurable parameter.
– The root process gathers a fixed number of bytes from all
processes in each cycle.
• Due to the fact that every process contributes in every
cycle a constant amount of data, this algorithm makes
a better use of the communication resources in the
cluster.
COSC 6374 – Parallel Computation
Edgar Gabriel
Static Segmentation Algorithm
• This algorithm would be good for the following
scenarios:
– With huge caches on the I/O nodes, storage devices are
decoupled from the compute cluster and thus show, from
the application perspective, virtually no sensitivity to
irregular or strided file access patterns.
– solid state hard drives (SSD): insensitive to irregular
access in the file.
5
COSC 6374 – Parallel Computation
Edgar Gabriel
Example: Write Static Segmentation
1 2 3 4 5 6
Process 0 (writer)
7 8 9 10 11 12
Process 1
13 14 15 16 17 18
Process 2
1 2 7 8 13 14
1 2 3 4 7 8 9 10 13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Cycle 1
Cycle 2
Cycle 3
Per process: 6 KBCycle Size: 2 KB
Each block is 1 KB
COSC 6374 – Parallel Computation
Edgar Gabriel
Individual Algorithm
• Avoid communication operations entirely and has each
process writes its data individually to the hard drive.
• Extended by using a scheduling approach to control the
number of processes concurrently performing I/O
operations, and thus limit the burden on meta-data
servers for some file systems.
6
COSC 6374 – Parallel Computation
Edgar Gabriel
Test Environment
Process 0 Process 1 Process 23…Meta-Dataserver
Front-endnode
Node 0 Node 1
…
Node 23
IB
GE
File Systems (PVFS2, NFS)
COSC 6374 – Parallel Computation
Edgar Gabriel
Test Case
• A simple benchmark where processes collectively write max_size bytes of data for a given number of
iterations to the file
• We measure the execution time of the test required to
write ALL data to file
• Each test has been executed three times, taking the
maximum bandwidth achieved
• Need to check the difference in performance while
varying algorithms/parameters and file system
7
COSC 6374 – Parallel Computation
Edgar Gabriel
PVFS2 Results
• Each process executes MPI_File_write_all operations
writing 20MB of data per function call, writing all-in-all
1GB of data to file. Thus, the overall file size for the 24
processes test cases is 24GB and for the 48 processes
test cases is 48 GB.
COSC 6374 – Parallel Computation
Edgar Gabriel
CBS=20MB – 24 Processes
8
COSC 6374 – Parallel Computation
Edgar Gabriel
CBS=20MB – 48 Processes
COSC 6374 – Parallel Computation
Edgar Gabriel
SS=20MB – 24 Processes
9
COSC 6374 – Parallel Computation
Edgar Gabriel
NFS Results
• Each process writes 100MB of data, in order to keep the
execution time within a reasonable time.
• The segment size is kept constant at 2MB
• The cycle buffer size is varied between 1MB, 10MB, and
20MB
COSC 6374 – Parallel Computation
Edgar Gabriel
SS=2MB – 24 Processes
10
COSC 6374 – Parallel Computation
Edgar Gabriel
SS=2MB – 48 Processes
COSC 6374 – Parallel Computation
Edgar Gabriel
Tuning Methodologies
• Results show a lot of factors affect the performance of
collective I/O operations:– File System
– Network interconnect
– Number of processes
– Size of file
– Algorithmic parameters (SS, CBS)
• Static tuning– Prior to the execution of the applications, tune for the best
algorithmic/parametric properties for a certain platform
• Dynamic tuning:– Tune at runtime
11
COSC 6374 – Parallel Computation
Edgar Gabriel
Motivation
• MPI I/O is good
– It knows about data types (=> data conversion)
– It can optimize various access patterns in applications
• MPI I/O is bad
– It does not store any information about the data type
• A file written as MPI_INT can be read as
MPI_DOUBLE in another application
• No information is stored, whether it is a two-
dimensional data array or anything else
COSC 6374 – Parallel Computation
Edgar Gabriel
Scientific data libraries
• Handle data on a higher level
• Add more information to the data (Metadata)
– Size of data structure
– Information about the numerical format
– Read and write data structures by name
– …or add units to your data
• Two widely used libraries available
– NetCDF
– HDF-5
12
COSC 6374 – Parallel Computation
Edgar Gabriel
HDF-5
• Hierarchical Data Format (HDF) developed since 1988
at NCSA (University of Illinois)
– http://hdf.ncsa.uiuc.edu/HDF5/
• Has gone through a long history of changes, the recent
version HDF-5 available since 1999
• HDF-5 supports
– Very large files
– Parallel I/O interface
– Fortran, C, Java bindings
COSC 6374 – Parallel Computation
Edgar Gabriel
HDF-5 dataset
• Multi-dimensional array of basic data elements
• A dataset consists of
– Header + data
• Header consists of
– Name
– Datatype : basic (e.g. HDF_NATIVE_FLOAT) or
compound dataypes
– Dataspace: defines size and shape of a multidimensional
array. Dimensions can be fixed or unlimited.
– Storage layout: defines how multidimensional arrays are
stored in file. Can be contiguous or chunked.
13
COSC 6374 – Parallel Computation
Edgar Gabriel
Example of an HDF-5 fileHDF5 “tempseries.h5” {
GROUP “/” {
GROUP “tempseries” {
DATASET “height” {
DATATYPE {“H5_STD_I32BE” }
DATASPACE ( ARRAY (4) (4) }
DATA {
0, 50, 100, 150
}
ATTRIBUTES “units” {
DATATYPE {“undefined string” }
DATASPACE { ARRAY (0) (0) }
DATA {
unable to print
}
}
}
DATASET “temperature” {
DATATYPE {“H5T_IEEE_F32BE” }
DATASPACE{ ARRAY( 3,8,4 ) (H5S_UNLIMITED, 8, 4) }
DATA {…}
COSC 6374 – Parallel Computation
Edgar Gabriel
Storage layout: contiguous vs. chunkedcontiguous chunked
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
8
16
24
32
40
48
56
64
1
5
9
13
33
37
41
45
2
6
10
14
34
38
42
46
3
7
11
15
35
39
43
47
4
8
12
16
36
40
44
48
21
25
29
49
57
61
22
26
30
50
58
62
23
27
31
51
59
63
24
28
32
52
60
64
17 18 19 20
53 54 55 56
� Advantages and disadvantages of chunking� Accessing rows and columns require the same
number of accesses� Data can be extended into all dimensions� Efficient storage of sparse arrays� Can improve caching
14
COSC 6374 – Parallel Computation
Edgar Gabriel
HDF-5 groups
• A HDF-5 group is a collection of data sets
– Comparable to a directory in a UNIX-like file system
• HDF-5 naming convention
– All API functions start with an H5
– The next character identifies category of functions
• H5F: functions handling files
• H5G: functions handling groups
• H5D: functions handling datasets
• H5S: functions handling dataspaces
• H5A: functions handling attributes
COSC 6374 – Parallel Computation
Edgar Gabriel
Typical steps for writing a sequential
HDF-5 file
1. Create the file
2. Create a group (opt.)
3. Define a dataspace
4. Define datatype
5. Create dataset
6. Add attributes
7. Write data
8. Close all objects
h5file = H5Fcreate(…)
group = H5Gcreate (h5file,…)
tspace = H5Screate_simple(ndims, dims, maxdims );
ttype = H5T_IEEE_F32BE;
tset = H5Dcreate (group, “testset”, ttype, tspace, …);
tattr = H5Acreate (tset, “units”, H5T_C_S1, …)
H5Awrite (tattr, H5T_C_S1, “meter”);
H5Dwrite(tset,H5T_IEEE_F32BE,…,data);
15
COSC 6374 – Parallel Computation
Edgar Gabriel
Reading an HDF-5 file – structure of the
file known
1. Open the file
2. Open the group
3. Open each dataset in
the group
4. Look up dimensions
5. Read data
6. Read attributes
7. Read comments
8. Close all objects
h5file = H5Fopen(…)
group = H5Gopen(h5file,”tempseries”)
tset = H5Dopen(group,”temperature”);
tspace = H5Dget_space( tset );
H5Sget_simple_extent_dims (tspace, dims, …);
H5Dread(tset,H5T_IEEE_F32BE, ttype, tspace, …, buffer);
tattr = H5Aopen_name(tset, “units”);
attrtype = H5Aget_type ( tattr );
H5Aread(tattr,attrtype,attr);
COSC 6374 – Parallel Computation
Edgar Gabriel
Compound Datatypes
• Abstraction for user structures
– Has a fixed size
– Each member has its own name, datatype, reference, and
byte offset
h5type = H5Tcreate( H5T_class class, size_t size);
H5Tinsert ( h5type, const char *name, off_t offset, hid_t field_id);
16
COSC 6374 – Parallel Computation
Edgar Gabriel
Hyperslab
• A hyperslab is a portion of a dataset
– Operator: H5S_SELECT_SET, H5S_SELECT_OR
– Start: array determining the starting coordinates of the hyperslab
– Stride: array indicating which elements along a dimension are to
be selected
– Count: array determining how many points to use in each
dimension
– Block: array determining the size of the element block by the
datatype
H5Sselect_hyperslab (hid_t space_id, H5S_seloper_t operator, const hssize_t *start, const hsize_t *stride, const hsize_t *count, const hsize_t *block);
COSC 6374 – Parallel Computation
Edgar Gabriel
Example using hyperslabs
/* Define hyperslab in the dataset. */
offset[0] = 1; count[0] = NX_SUB;
offset[1] = 2; count[1] = NY_SUB;
status = H5Sselect_hyperslab (dataspace, H5S_SELECT_SET, offset,
NULL, count, NULL);
/*Read data from hyperslab in file into hyperslab in memory */
status = H5Dread (dataset, H5T_NATIVE_INT, memspace, dataspace,
H5P_DEFAULT, data_out);
offset[0]
offset[1]
count[0]
count[1]Examples taken from HDF-5 webpage
17
COSC 6374 – Parallel Computation
Edgar Gabriel
More complex example using
hyperslabsMemory
Process 0
Process 1
File
count[0] = 1;
count[1] = dimsmem[1];
block[0] = dimsfile[0];
block[1] = 1;
offset[0] = 0;
offset[1] = mpi_rank;
stride[0] = 1;
stride[1] = 2;
dimsmem[1]
block[0]
dimsmem[0]
For dimension x:you generate count[x] entries of block[x] elements starting from offset[x]. The distance between each element is stride[x].
offset[0]
offset[1] on rank0 1
Examples taken from HDF-5 webpage
COSC 6374 – Parallel Computation
Edgar Gabriel
Parallel I/O with HDF-5
• Relies on MPI I/O
• Program has to use special properties (hints) indicating
to use parallel I/O during
– File creation and file open
– Data access
• Properties are set through H5P… functions
/* example for using file properties */
fileprops = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_fapl_mpio ( fileprops, MPI_COMM_WORLD, MPI_INFO_NULL);
h5file = H5Fcreate ( “tempseries.h5”, …, fileprops);
18
COSC 6374 – Parallel Computation
Edgar Gabriel
Parallel data access in HDF-5
• Application has to define a set of interleaved file data
spaces on the processes that will access the file
– Similar technique like setting the file-view in MPI I/O
– Usually based on defining hyperslabs
• Data transfer properties have to be set
/* example for using file properties */
fileprops = H5Pcreate (H5P_DATASET_XFER);
H5Pset_dxpl_mpio ( fileprops, H5FD_MPIO_COLLECTIVE);
H5Dwrite ( tset, …, fileprops);