July 13, 2012 BOSC 2012 1July 13, 2012
Using HDF5 To Work With Large Quantities of Rich Biological Data Dana Robinson (derobins @hdfgroup.org) The HDF Group
July 13, 2012 BOSC 2012 2
Today's Goal
Is that you walk away from this talk with a basic understanding of the HDF5 technology stack.
July 13, 2012 BOSC 2012 3
Where is HDF5 used?
July 13, 2012 BOSC 2012 4
What is HDF5?
HDF5 is a highly scalable way to organize and store heterogeneous, multidimensional data of user-defined types.
HDF5 also allows data relationships and context to be stored using annotation and linking.
July 13, 2012 BOSC 2012 5
HDF5The HDF5 technology suite includes:
• A structured binary file format
• An abstract data model for describing your data
• A data access library, written in C(w/ bindings for C++, Fortran 95/2003, and Java)
July 13, 2012 BOSC 2012 6
HDF5 has characteristics of …
April 17-19, 2012
PDF• standard
exchange format• heterogeneous
information
XML• self-describing• extensible
types• rich metadata
Binary Flat File• high-
performance
Databases• subsetting• random access
Directories and Files
• hierarchical• collections of
related information HDF5
July 13, 2012 BOSC 2012 7
Advantages of HDF5
• Platform and architecture-independent
• Scalable in space and time• File size only limited by OS and filesystem• Data access time (esp. parallel) scales well
• Flexible (user-defined types and organization)
• Files are self-describing
July 13, 2012 BOSC 2012 8
Advantages of HDF5 (2)
• High-performance
• Parallel I/O via MPI-IO
• Supports compression and other filters
• Open source (BSD license)
• THG committed to provide long-term support
July 13, 2012 BOSC 2012 9
HDF5 Data Objects
• Groups• Datasets
• Datatypes• Metadata (Attributes)
July 13, 2012 BOSC 2012 10
Example: LCMS Data
MS spectr
a
MS/
MS
spec
tra
protein IDs
chromatography parameters
protein ID type
ms parameters ms/ms parameters
sample name
July 13, 2012 BOSC 2012 11
HDF5 Data Access
Unlike many data storage systems, HDF5 has no built-in query engine or indexes.
You will have to write your own data access code, usually using the HDF5 API.
July 13, 2012 BOSC 2012 12
Dataspaces
HDF5 has a rich set of data subsetting functionality.Example: displaying a thumbnail of a high-resolution image.
July 13, 2012 BOSC 2012 13
Filters and Compression
Note that HDF5 data objects are filtered individually, not the entire file!
HDF5 supports data filters, including compression, which transform data as it enters or leaves the file.
compression filter
compressed data in the file
uncompressed data in user's buffer
July 13, 2012 BOSC 2012 14
Higher Language Bindings
C++ Fortran (95 & 2003) Java .NET Python
• C++ & Fortran distributed with library• Java distributed separately• .NET distributed separately, not supported by THG (as-is)• Python (PyTables, h5py) not distributed by THG
NOTE:HDF5 bindings are thin wrappers over the C API.
• There is no object-oriented interface to HDF5• Not pure Java, .NET, etc.
July 13, 2012 BOSC 2012 15July 13, 2012
Questions?
Helpful links
THG www.hdfgroup.orgDownloads www.hdfgroup.org/HDF5/release/obtain5.htmlDocumentation www.hdfgroup.org/HDF5/doc/index.htmlBioinformatics www.hdfgroup.org/projects/bioinformatics/Tutorials www.hdfgroup.org/HDF5/Tutor/index.htmlContact/help desk www.hdfgroup.org/about/contact.html