rfc: breaking free from the collective requirement for hdf5 metadata operations

27
RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Upload: mavis-hood

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Page 2: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Parallel I/O

• One or more processes accessing data on disk at the same time (usually within an application)

• Used on platforms that access data on parallel file systems (Lustre, GPFS, PVFS2):– Multiple I/O servers– Data is striped across servers

• Several parallel programming models available:– MPI is the dominant

Page 3: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Message Passing Interface

• MPI: standardized interface for message passing between processes:– point-to-point communication– collective communication– one-sided communication (Remote Memory

Access)– derived datatypes– process grouping (MPI Communicator) – parallel I/O standard (MPI-I/O)

Page 4: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

MPI-I/O

• Standard interface for doing I/O• relaxed consistency semantics:– scope– synchronization

• concept of a file view, derived datatypes• individual and collective I/O operations• blocking & non-blocking I/O operations

Page 5: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

• 4 processes accessing a 2D matrix stored in row-major format• Each process can define the portions it will access as its file view (optional)• Access can be defined at I/O time using derived datatypes (similar to HDF5 file selections)

• Individual I/O: • Each process posts its access to disk independently

• Collective I/O:• All processes participate in the I/O operations• Different algorithms underneath the hood to optimize I/O

access to disk.

Example

Page 6: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Parallel HDF5 (I)

• MPI-I/O offers a limited interface to the user:– A file is seen as a stream of bytes.– The user is responsible to structure his data on file using MPI

derived datatypes which is very difficult for a scientific application programmer.

• HDF5 offers a higher level of access through different objects and I/O access.

• Parallel HDF5 is layered on top of MPI-I/O to hide the cumbersomeness of using MPI-I/O to access structured data.

• The user gets the nice HDF5 interface, but the HDF5 library still has to provide an efficient and optimized access to data on disk.

Page 7: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Parallel HDF5 (II)

• Parallel HDF5 still has some limitations imposed on the user:– PHDF5 allows the user to access data on disk individualy or

collectively, however operations that modify metadata itself are required to be collective.

• While this requirement satisfies the application needs in some cases, in other cases it adds significant overhead and restrictions on application developers.

• A similar requirement is placed internally on space allocation, where processes need to communicate in a collective manner to be able to allocate space.

Page 8: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

RFC

• In this RFC, we propose several design options that will remove the collective requirement for space allocation and operations that require modifying the file metadata.

• Any solution has to achieve the following:– Allow an independent (but globally visible), scalable "fetch-and-

add" operation to the "end of allocated space" (EOA) value in the file (to allow new space in the file to be allocated to an individual process, without involving all the other processes).

– Allow [the metadata for] an object to be locked, updated, unlocked, and made visible to other processes without requiring a collective operation (to allow changes to the structure of a file).

Page 9: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Use case 1• A parallel application with n processes, and each process creates a

dataset in the root group for its private use. • This is currently done by having all processes call the dataset

creation function n times, despite the fact that each process needs 1 dataset only.

/

P1

P2

P3

P4

P6

P5

Page 10: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Use case 2

• A process within a parallel application doing independent read/write to a dataset while another process comes in and modifies the metadata of that Dataset, making the former process’s access possibly obsolete.

• Currently, this use case would not happen, because all processes are required to be present at the time of metadata modification. However, this use case becomes possible when the collective requirement to metadata modification is removed and should be taken into consideration by any design option.

Page 11: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Use case 3, 4, 5

• Collective read/write to a dataset, where all processes that opened the file are required to call the function with or without data to read/write.

• 2 or more processes creating an attribute on their own datasets. In the current implementation, similar to use case U.1, the attribute creation function needs to be collective, even though they are being created on different datasets.

• 2 or more processes creating an attribute on a shared dataset. A solution that makes this operation independent has to take into account that there are other processes that can create an attribute (or any other object) at the same time and those objects might overlap in the file if they are not serialized.

Page 12: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Space Allocation at EOA

• One sided Fetch and Add problem• MPI does not provide a FAD implementation,

but does provide one sided RMA operations (Put, Get, Accumulate).

• We use an algorithm introduced by Bill Gropp in Using MPI-2 to implement an MPI based FAD operation.

Page 13: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

MPI FAD

• Create an MPI window on all processes with the root process exposing a memory buffer where the fetch and add will take place.

• All processes wanting to allocate space from the EOA will execute the following code section to get the current EOA and increment it atomically.

MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);MPI_Accumulate(increment_by_x, nlevels, MPI_INT, 0, 0, 1,

acc_type, MPI_SUM, win);MPI_Get(get_array, nlevels, MPI_INT, 0, 0, 1, get_type, win);MPI_Win_unlock(0, win);

• The current EOA for this process would be the increment + all the values in get_array.

Page 14: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Performance

• Some of the disadvantages of this approach:– Needs a set-aside process, or else the root process will

not be able to serve promptly the RMA operations.– We hit performance issues when a large number of

processes try to fetch and add at the same time repeatedly, however we don’t see that as a common case.

Num_procs N N / 2 1

128 1.137938 0.628627 0.015955

256 2.459199 1.314574 0.015274

512 5.186699 2.874677 0.030334

1024 13.985337 6.528552 0.027759

5120 249.922206 59.980488 0.039649

10240 1923.69371 215.603612 0.049559

Page 15: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Independent Metadata Operations

• 3 classes of solutions:– Distributed Lock Manager (DLM)– Synchronization Step– Metadata Server (MDS)• Enhanced MDS• Data Staging

Page 16: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

DLM

• Use the DLM to serialize access to an HDF5 object by requiring any process that needs access to the object obtain a lock (shared/exclusive)

• DLM implementation:– RMA & point-to-point– Point-to-point non-blocking operations

Page 17: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

RMA & Point-to-Point (I)• Allocate an MPI window at the DLM with size = n bits, n being the

number of processes.• If a process k wants a lock, then using one RMA access epoch:

– get the value of all the n-1 bits that correspond to the status of other processes– switch the kth bit on to indicate that process k asked for the lock.

• If all n-1 bits are zero, then it’s safe to acquire the lock, otherwise, the process will post an MPI_Recv from MPI_ANY_SOURCE that will block until some process tells it that it is safe to acquire the lock.

• To release the lock, another access epoch is done to the root process to switch the kth bit off and determine to which process the lock needs to be sent to (if any).

• Shared locks can be implemented by extending this algorithm to use nx2 bit field.

Page 18: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

RMA & Point-to-Point (II)

0 21 3 4

0 0 1 0 0 Request lockLock Granted

1 1 1 0 0

1 1 0 0 0

0 0 0 0 0

0 1 0 0 0

0 1 0 0 0

1 1 0 0 0

Request lockWait

Request lockWait

Release lockSend lock to 0

Receive lock

Release lockSend lock to 1

Receive lock

Release lock

Page 19: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

DLM

• In addition to acquiring locks, the DLM must also handle metadata changes. The process releasing the lock is also responsible to push all changes it made to metadata to the DLM or the process the lock is being sent to.

• Once the changes made are large enough, a flush to disk can be initiated from the DLM or any process that has the lock on an object.

• Since there are multiple objects that need to be locked, the DLM needs to maintain a queue (the n bit field) for each object.

• Need to deal with deadlocks

Page 20: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Synchronization Step

• Each process modifies its metadata cache independently and logs the changes in a list.

• Whenever a collective operation, or a flush, is issued by the user, the processes enter a sync point, and resolve any conflicts that appear in the logs.

• This approach is meant for applications where users guarantees that the HDF use of metadata is simple enough that the resolve step does not introduce unresolvable conflicts.

• The Metadata operations are not entirely independent, since a sync step is still required.

Page 21: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

MDS (I)

• A server opens the file and reads/writes metadata to disk, instead of just managing access to metadata (DLM).

• Clients do not touch the metadata on disk.• The client, for example, sends a request to the MDS to

create a dataset. The MDS would create the dataset and return it to the client.

• Locking still needs to be done when a client wants to do raw data reads/writes on a mutable object, to make sure that the structure of the object does not change.

Page 22: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

MDS (II)

• Having 1 MDS is not scalable, so at some point we need more than one.

• To handle hot objects, we can use a DHT to replicate the objects across the servers.

• Clients can still cache metadata and bypass the MDS or keep communication with the MDS very simple in certain situations.

Page 23: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Enhanced MDS (I)

• Objects in HDF5 will be marked as private or public.

• Access to metadata of public objects has to go through the MDS, whereas access to metadata of private objects can be done locally.

• If a private object is needed to be made public, the process that holds it needs to transfer the “ownership” to the MDS, and that includes pushing all the metadata to the MDS or writing the metadata to disk and informing the MDS about it.

Page 24: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Enhanced MDS (II)MDS

MDS

Local Local LocalLocal Local Local

Page 25: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Data Staging

• Similar to what ADIOS does.• Set aside processes to handle all HDF5

operations (metadata and raw data)• Asynchronous I/O operations:– Must not interfere with application

Page 26: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Progress Function

• An alternative to setting aside processes• A library on top of MPI that the user calls to

make progress.• The user is responsible to insert progress

functions in his code to allow designated “MDS” or “DLM” processes to handle outstanding requests.

Page 27: RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Recommendation

• As mentioned earlier, this work is still at the design phase. • While we have a tentative roadmap on how to progress with

this work, things might change depending on various reasons, such as new input from the HPC community, funding, etc…

• We are looking at proceeding with: – the Gropp FAD approach for the space allocation problem– the Enhanced MDS approach for the metadata operations, while

moving later towards Data Staging.• We would like to provide both options to the user in terms of

either setting aside processes or calling progress functions in the application.