an efficient and transparent transaction management based on the data workflow of hvem datagrid
Post on 24-Jan-2016
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
Im Young Jung
Seoul National University
An Efficient and Transparent Transaction Management based on the Data Workflow
of HVEM DataGrid
IntroductionTransaction Management for a safe data update and
insertion on e-Science DataGridHeterogeneous storages according to the characteristics
and the size of dataBased on workflow, the storing precedence of data across
heterogeneous storages in a transaction
In this paperAn efficient and transparent transaction management on
HVEM DataGridDividing the transaction into sub-transactions according to the
transaction states and Classifying them Transaction hierarchy and parallelism provide
efficient and safe large data upload to HVEM DataGrid transparency in the transaction including simultaneous access
to heterogeneous storagesAutomatic garbage collection
2
HVEM Grid
3
High Voltage Electron Microscope(HVEM)Let scientists realize the 3D structure analysis of new
materials in micrometer-scaleHVEM Grid
Remote users can perform the same tasks as on-site scientists.Remote controlling of HVEMStoring, retrieval and search data through HVEM DataGridProcessing data through HVEM Computational Grid
4
Designed for Biologic experiments using HVEM A logical view of one storage for DB and file storage
The small metadata is stored at DB Information for materials, material handling methods, HVEM
experiments, Images, experimenters The large files are stored in file storages
2D or 3D image files, the documents related to HVEM experiments
Internal process to find files After finding their logical path in the file storage by
searching the DB, users can retrieve the files they want in the file storage
HVEM DataGrid
HVEM DataGrid
5
A unified data management The storing precedence
among dataWhen store all biological
information for the images, we should keep the images in HVEM Grid at the same time
The relational semantics between various data stored in distributed heterogeneous storages
To upload many large files to HVEM DataGrid efficiently and safelyUpload dependency &
SerializationEnsure the transactions for
safe parallel uploads
An efficient and transparent transaction management
6
Requirement for the transactions on HVEM DataGridConsider the semantic of HVEM DataGrid
A project is composed of several experimentsThe data for an experiment should be inserted according to its data
workflowThe file and its metadata should be stored to HVEM DataGrid
simultaneously. Otherwise, all of them should be deletedSupport
the long lifetime transaction according to the timelimit of experiment or project
the short lifetime transaction which stores the data to HVEM DataGrid physically
The optimization for the upload of large files to reduce the blocking time should ensure safe transactionsAn asynchronous and parallel upload scheme should protect upload
dependency and ensure safe transactions
An efficient and transparent transaction management
7
Transaction hierarchyThe transaction units as
checkpoints on incomplete data insertion Confine the rollback extent
When the data for an experiment or a project is not inserted to HVEM DataGrid until each timelimit, the experiment or the project should be vanished by the rollback of TnE or TnP
TnS((((1)2)5)2)(1) represents the identity of TnP
it belongs toThe next index ‘2’ indicates the
identity of TnE and so on
For Project For
Experiment
For a group of
TnSs
For storing data to physical storage
Support Autonomous garbage collection It is dependent on users to insert data or delete it on HVEM
DataGrid. When they do not insert experimental data any more due to any
reason without deleting the related data, HVEM DataGrid would have a big garbage.
Parallel Processing
8
Transaction management Scheme
HVEM DataGrid forks two processes to connect DB and file storage each. When the connections succeed, it gets the next requests and so on. The state change of TnS(((())j)i)
jSiS jSiD(the notification from DB), jSiF(the notification from the file storage) jSiE (both of them arrive) : TnS completes
In the light failure(LF) due to temporary failures on network or server, retry the transaction fixed times
When the retries fail, a serious failure(SF) is assumed rollback process
Evaluation
9
AnalysisTransparency
Through transaction hierarchy and fine grained state management the transaction manager in HVEM DataGrid enables the transparent
transaction to upload the image files to the file storage and store their metadata to DB simultaneously.
Serializability Many TnSs are upload serializable because their state changes are logged
through transaction index. To keep the upload dependency,
the transaction manager protects the first user entering TnW.o If he withdraws the TnW, then an other user can initiate the TnW
Transaction performance Support the transaction scheme asynchronism and parallelism Experiment Setting
Because the sub-transaction time on DB is negligible compared with that on file storage due to data size, we only considered the upload time for image file
Considering the semantic of the data workflow in HVEM DataGrid For an asynchronous file transfer, the request intervals for file transfer are
chosen randomly within 50 sec The physical locations of the file storages are assumed to be distributed
10
OverheadLog management cost
The cost for TnP, TnE and TnW; The general transaction management requires the log for TnS The log size for TnP, TnE and TnW is smaller than that for TnS because
they function as checkpoint rather than real transaction units.Rollback cost
The cascade rollback of TnS in TnW due to the upload dependency on parallel processing of TnS At LF, if the retry succeeds, the gain from transaction parallelism can be
very large especially for large file handling There are not many SFs or LFs because e-Science DataGrid is not popular
as the multimedia storage
Evaluation
ConclusionA transaction management on HVEM Grid
SafetyEnsure a safe transaction considering the data workflow in
HVEM DataGridEfficiency
Improve the performance to upload large files by asynchronism and parallelism
TransparencyData management across the heterogeneous storages
Automatic garbage collectionReduce garbage
11
top related