ncsa-nara investigations of hdf5 in support of express-driven data

Post on 15-Jan-2016

26 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data. Mike Folk The HDF NARA Project PDES, Inc. Offsite Meeting September 24-29, 2006. Acknowledgement. - PowerPoint PPT Presentation

TRANSCRIPT

NCSA-NARA investigations of HDF5

in support of EXPRESS-Driven dataMike Folk

The HDF NARA Project

PDES, Inc. Offsite MeetingSeptember 24-29, 2006

PDES, Inc. Offsite Sept 2006 2

Acknowledgement

This report is based upon work supported by the National Archives and Records Administration (NARA)

through the grant NARA NSF 0202 GPG. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily

reflect the views of the NARA.

PDES, Inc. Offsite Sept 2006 3

ParticipantsMike Folk, Vailin Choi, Elena Pourmal – The

HDF GroupMark Conrad and Bob Chadduck – NARADavid Price – EuroSTEPKeith Hunten – Lockheed-MartinSteve Cooper and Denny Moore – Electric

BoatOthers

1. What is HDF5?

PDES, Inc. Offsite Sept 2006 5

HDF5 is

• A file format for managing any kind of data

• Software system to manage data in the format

• Suited especially to large volume or complex data

• Suited for every size and type of system• Open file format, open software

PDES, Inc. Offsite Sept 2006 6

Definitions• “HDF” – Hierarchical Data Format

• Originated in 1988• NCSA at University of Illinois at Urbana-

Champaign

• “HDF5” • Successor to HDF, introduced in 1998

PDES, Inc. Offsite Sept 2006 7

An HDF5 file is a container…

lat | lon | temp----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6

palette

palette

……into into which you which you can put can put your data your data objects.objects.

PDES, Inc. Offsite Sept 2006 8

HDF5 data model• HDF5 file – container for data objects• Primary Objects

• Groups• Datasets

• Additional ways to organize data• Attributes for metadata• Sharable objects• Storage and access properties

Everything else is built from

Everything else is built from

these parts.

these parts.

PDES, Inc. Offsite Sept 2006 9

HDF “groups” for organizing objects in files

palettepalette

Raster imageRaster image

3-D array3-D array

2-D array2-D arrayRaster imageRaster image

lat | lon | templat | lon | temp----|-----|---------|-----|----- 12 | 23 | 3.112 | 23 | 3.1 15 | 24 | 4.215 | 24 | 4.2 17 | 21 | 3.617 | 21 | 3.6

TableTable

““/” /” (root)(root)““/” /” (root)(root)

““/foo”/foo”““/foo”/foo”

PDES, Inc. Offsite Sept 2006 10

HDF5 “dataset” for holding the data

DataMetadataDataspaceDataspaceDataspaceDataspace

3

RankRank

Dim_2 = 5Dim_1 = 4

DimensionsDimensions

time = 32.4

pressure = 987

temp = 56

AttributesAttributes

Chunked

compressed

Dim_3 = 7

Storage infoStorage info

IEEE 32-bit float

DatatypeDatatype

PDES, Inc. Offsite Sept 2006 11

Datatypes (array elements)• Datatype – how to interpret a data

element• Two classes: atomic and compound

PDES, Inc. Offsite Sept 2006 12

Datatypes• HDF5 atomic types

• normal integer & float• user-definable (e.g. 13-bit integer)• fixed length and variable length multiples (e.g.

strings)• references to objects/dataset regions• enumeration - names mapped to integers• array

• HDF5 compound types• Records with fields – comparable to C structs • Members can be atomic or compound types

PDES, Inc. Offsite Sept 2006 13

“Groups”• A mechanism for

collections of related objects

• Every file starts with a root group

• Similar to UNIX directories

• Can have attributes

“/”tom dick

harry

a b c

PDES, Inc. Offsite Sept 2006 14

Special Storage OptionsBetter subsetting Better subsetting access time; access time; extendableextendable

chunked

Improves storage Improves storage efficiency, efficiency, transmission speedtransmission speed

compressedcompressed

Arrays can be Arrays can be extended in any extended in any directiondirection

extendableextendable

Metadata for FredMetadata for FredMetadata for FredMetadata for Fred

Dataset “Fred”Dataset “Fred”Dataset “Fred”Dataset “Fred”

File AFile A

File BFile B

Data for FredData for Fred

Metadata in one file, Metadata in one file, raw data in another.raw data in another.Split fileSplit file

PDES, Inc. Offsite Sept 2006 15

Mesh Example, in HDFView

PDES, Inc. Offsite Sept 2006 16

HDF5 Software

Tools & ApplicationsTools & ApplicationsTools & ApplicationsTools & Applications

HDF FileHDF FileHDF FileHDF File

HDF I/O LibraryHDF I/O LibraryHDF I/O LibraryHDF I/O Library

PDES, Inc. Offsite Sept 2006 17

Features of library• Ability to create and access complex data

structures• Fast, flexible I/O• Data transformation and filtering during I/O• Flexible API for power users• Compatibility with common data models

• Able to represent all common data structures• Supports key language models – C, Fortran,

Java, etc.

PDES, Inc. Offsite Sept 2006 18

Other info• Library and tools run almost anywhere• Other software from THG

• Java viewer• Command-line utilities

• Other software• Commercial (IDL, Matlab, Labview, etc.)• Community (EOS, ASCI, etc.)• Integration with other software (SRB,

databases, etc.)

PDES, Inc. Offsite Sept 2006 19

Making HDF useful for your application• There are many ways to organize and

access data in HDF5• How do we apply these capabilities to a

particular domain, such as product data?• We have to decide how we will organize and

access our data in a way that best addresses our needs.

• And create data models, APIs and tools as appropriate to support our applications.

• Or adapt existing data models, APIs and tools as appropriate to support our applications.

Sample uses of HDF

PDES, Inc. Offsite Sept 2006 21

1. NASA Earth Observing System (EOS)

Aqua (6/01)Aura

TES HRDLSMLS OMI

Terra

CERES MISR

MODIS MOPITT

AquaCERES MODIS

AMSR

PDES, Inc. Offsite Sept 2006 22

2. Advanced Simulation & Computing (ASC)

Question: How do we maintain a nuclear stockpile in the absence

of testing?

Answer: Very large simulations

PDES, Inc. Offsite Sept 2006 23

ASC Data requirements• Large datasets (> a terabyte) • Fast I/O on massive parallel systems • Complex data and extensive metadata• Availability on leading edge systems

3. Bioinformatics

--

Managing genomic data

caacaagccaaaactcgtacaacaacaagccaaaactcgtacaaCgagatatctcttggaaaaactCgagatatctcttggaaaaactgctcacaatattgacgtacaaggctcacaatattgacgtacaaggttgttcatgaaactttcggtagttgttcatgaaactttcggtaAcaatcgttgacattgcgacctAcaatcgttgacattgcgacctaatacagcccagcaagcagaataatacagcccagcaagcagaat

PDES, Inc. Offsite Sept 2006 25

DNA sequencing workflows are complex

• Diverse formats• Highly redundant data• Multiple levels of

information• Complex associations• Repeated file

processing• Non-scalable storage• Lack of persistence

PDES, Inc. Offsite Sept 2006 26

HDF5 as binary exchange format for bioinformatics

4. Flight test data

PDES, Inc. Offsite Sept 2006 28

Boeing flight test

HDF role in the Software Stack

PDES, Inc. Offsite Sept 2006 30

StorageStorage

File on parallelFile on parallelfile systemfile systemFileFile

Split metadata Split metadata and raw data filesand raw data files

User-definedUser-defineddevicedevice

?? Across the networkAcross the networkor to/from anotheror to/from another

application or libraryapplication or libraryHDF5 formatHDF5 format

HDF5HDF5 data model & API data model & API

Apps: simulation, visualization, remote sensing…

Examples: Thermonuclear simulationsProduct modelingData mining tools

Visualization toolsClimate models

Common application-specific data models

HDF5 virtual file layer (I/O drivers)HDF5 virtual file layer (I/O drivers)

MPI I/OMPI I/OSplit FilesSplit FilesStdioStdio CustomCustom StreamStreamHDF5 serial & HDF5 serial &

parallel I/Oparallel I/O

BioHDF SAF HDF-Packet HDF-EOSMatlabapp-specificapp-specific API or GUI

LANL LLNL, SNL Grids COTS NASA

2. Why is there interest in HDF5 for

product data? (Courtesy of David Price, EuroSTEP)

PDES, Inc. Offsite Sept 2006 32

Needs• STEP and related models exist using

EXPRESS• ASCII, XML STEP formats defined,

software developed• But ASCII/XML don’t adapt well for

highly voluminous, complex data• Finite element analysis• Computational fluid dynamics• Heterogeneous product data

PDES, Inc. Offsite Sept 2006 33

EuroSTEP project• VIVACE: “Value Improvement through a

Virtual Aeronautical Collaborative Enterprise”

• Deliverable: EXPRESS-driven Large Volume Binary Data Representation

PDES, Inc. Offsite Sept 2006 36

Survey of State of the Art• Candidates

• ASN.1 : Abstract Syntax Notation 1• HDF5 : Hierarchical Data Format• XML/Binary• CGNS : CFD General Notation System• SDAI implementation by LKSoft

• Found HDF5 most suitable for very large scientific datasets and complex relationships

Goal:Create open-source

toolkit mapping EXPRESS to HDF5

PDES, Inc. Offsite Sept 2006 38

StorageStorage

File on parallelFile on parallelfile systemfile systemFileFile

Split metadata Split metadata and raw data filesand raw data files

User-definedUser-defineddevicedevice

?? Across the networkAcross the networkor to/from anotheror to/from another

application or libraryapplication or libraryHDF5 formatHDF5 format

HDF5HDF5 data model & API data model & API

Apps: simulation, visualization, remote sensing…

Examples: Thermonuclear simulationsProduct modelingData mining tools

Visualization toolsClimate models

Common application-specific data models

HDF5 virtual file layer (I/O drivers)HDF5 virtual file layer (I/O drivers)

MPI I/OMPI I/OSplit FilesSplit FilesStdioStdio CustomCustom StreamStreamHDF5 serial & HDF5 serial &

parallel I/Oparallel I/O

BioHDF SAF HDF-Packet HDF-EOSMatlabappl-specificappl-specific

APIsLANL LLNL, SNL Grids COTS NASA

Product model Applications

Examples: Thermonuclear simulationsProduct modelingData mining tools

Visualization tools

STEP data models

STEP-HDF5

NARA-sponsored work

PDES, Inc. Offsite Sept 2006 40

NCSA-THG NARA Research• Investigate the viability of scientific data

formats, such as HDF5, for long-term preservation of engineering data in the federal archives

PDES, Inc. Offsite Sept 2006 41

Heterogeneous data aggregation, with HDF5 • Goal:

Using NARA’s TWR collection, investigate the possibilities and limitations of using HDF5 as a container for archiving heterogeneous collections of records, with special attention to STEP data.

PDES, Inc. Offsite Sept 2006 42

Activities• Use files, datatypes, structures in NARA

TWR collection – STEP files, photos, schematics, etc.

• Map these to HDF5 objects and structures, exploiting features of HDF5

• Assess benefits and costs in terms of storage efficiency and accessibility

• Investigate use of HDF5 as container for collection

PDES, Inc. Offsite Sept 2006 43

Relationship EuroSTEP, Electric Boat, et al

• Working together to develop mappings from EXPRESS to HDF5

• Sharing data for testing• Periodic meetings to share information

and coordinate research• Some involvement with standardization

PDES, Inc. Offsite Sept 2006 44

Investigating I/O efficiency and size • Explore different datatypes and storage

options for b-spline surface models (later: finite element models)

• Two types of data – b-splines themselves and cartesian points

• Variables• Different HDF5 datatypes• Dataset compression• Use of extra indexes in HDF5 for fast access

PDES, Inc. Offsite Sept 2006 45

Some results• Small files

• HDF5 not appreciably better then STEP, sometimes worse

• Large files• Compression always made HDF5 files smaller• Even without compression, HDF5 storage better• Indexing approach also tended to save space

• Lessons• HDF5 can provide very efficient storage for

cartesian points• Choice of data types and data storage is important

HDF5 as container

HDFView Demo

PDES, Inc. Offsite Sept 2006 47

Thank you

PDES, Inc. Offsite Sept 2006 49

HDF Information• HDF Information Center

• http://hdfgroup.org/

• HDF Help email address• help@hdfgroup.org/

• HDF users mailing list• hdfnews@hdfgroup.org/

top related