ncsa-nara investigations of hdf5 in support of express-driven data

47
NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data Mike Folk The HDF NARA Project PDES, Inc. Offsite Meeting September 24-29, 2006

Upload: payton

Post on 15-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data. Mike Folk The HDF NARA Project PDES, Inc. Offsite Meeting September 24-29, 2006. Acknowledgement. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

NCSA-NARA investigations of HDF5

in support of EXPRESS-Driven dataMike Folk

The HDF NARA Project

PDES, Inc. Offsite MeetingSeptember 24-29, 2006

Page 2: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 2

Acknowledgement

This report is based upon work supported by the National Archives and Records Administration (NARA)

through the grant NARA NSF 0202 GPG. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily

reflect the views of the NARA.

Page 3: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 3

ParticipantsMike Folk, Vailin Choi, Elena Pourmal – The

HDF GroupMark Conrad and Bob Chadduck – NARADavid Price – EuroSTEPKeith Hunten – Lockheed-MartinSteve Cooper and Denny Moore – Electric

BoatOthers

Page 4: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

1. What is HDF5?

Page 5: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 5

HDF5 is

• A file format for managing any kind of data

• Software system to manage data in the format

• Suited especially to large volume or complex data

• Suited for every size and type of system• Open file format, open software

Page 6: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 6

Definitions• “HDF” – Hierarchical Data Format

• Originated in 1988• NCSA at University of Illinois at Urbana-

Champaign

• “HDF5” • Successor to HDF, introduced in 1998

Page 7: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 7

An HDF5 file is a container…

lat | lon | temp----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6

palette

palette

……into into which you which you can put can put your data your data objects.objects.

Page 8: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 8

HDF5 data model• HDF5 file – container for data objects• Primary Objects

• Groups• Datasets

• Additional ways to organize data• Attributes for metadata• Sharable objects• Storage and access properties

Everything else is built from

Everything else is built from

these parts.

these parts.

Page 9: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 9

HDF “groups” for organizing objects in files

palettepalette

Raster imageRaster image

3-D array3-D array

2-D array2-D arrayRaster imageRaster image

lat | lon | templat | lon | temp----|-----|---------|-----|----- 12 | 23 | 3.112 | 23 | 3.1 15 | 24 | 4.215 | 24 | 4.2 17 | 21 | 3.617 | 21 | 3.6

TableTable

““/” /” (root)(root)““/” /” (root)(root)

““/foo”/foo”““/foo”/foo”

Page 10: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 10

HDF5 “dataset” for holding the data

DataMetadataDataspaceDataspaceDataspaceDataspace

3

RankRank

Dim_2 = 5Dim_1 = 4

DimensionsDimensions

time = 32.4

pressure = 987

temp = 56

AttributesAttributes

Chunked

compressed

Dim_3 = 7

Storage infoStorage info

IEEE 32-bit float

DatatypeDatatype

Page 11: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 11

Datatypes (array elements)• Datatype – how to interpret a data

element• Two classes: atomic and compound

Page 12: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 12

Datatypes• HDF5 atomic types

• normal integer & float• user-definable (e.g. 13-bit integer)• fixed length and variable length multiples (e.g.

strings)• references to objects/dataset regions• enumeration - names mapped to integers• array

• HDF5 compound types• Records with fields – comparable to C structs • Members can be atomic or compound types

Page 13: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 13

“Groups”• A mechanism for

collections of related objects

• Every file starts with a root group

• Similar to UNIX directories

• Can have attributes

“/”tom dick

harry

a b c

Page 14: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 14

Special Storage OptionsBetter subsetting Better subsetting access time; access time; extendableextendable

chunked

Improves storage Improves storage efficiency, efficiency, transmission speedtransmission speed

compressedcompressed

Arrays can be Arrays can be extended in any extended in any directiondirection

extendableextendable

Metadata for FredMetadata for FredMetadata for FredMetadata for Fred

Dataset “Fred”Dataset “Fred”Dataset “Fred”Dataset “Fred”

File AFile A

File BFile B

Data for FredData for Fred

Metadata in one file, Metadata in one file, raw data in another.raw data in another.Split fileSplit file

Page 15: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 15

Mesh Example, in HDFView

Page 16: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 16

HDF5 Software

Tools & ApplicationsTools & ApplicationsTools & ApplicationsTools & Applications

HDF FileHDF FileHDF FileHDF File

HDF I/O LibraryHDF I/O LibraryHDF I/O LibraryHDF I/O Library

Page 17: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 17

Features of library• Ability to create and access complex data

structures• Fast, flexible I/O• Data transformation and filtering during I/O• Flexible API for power users• Compatibility with common data models

• Able to represent all common data structures• Supports key language models – C, Fortran,

Java, etc.

Page 18: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 18

Other info• Library and tools run almost anywhere• Other software from THG

• Java viewer• Command-line utilities

• Other software• Commercial (IDL, Matlab, Labview, etc.)• Community (EOS, ASCI, etc.)• Integration with other software (SRB,

databases, etc.)

Page 19: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 19

Making HDF useful for your application• There are many ways to organize and

access data in HDF5• How do we apply these capabilities to a

particular domain, such as product data?• We have to decide how we will organize and

access our data in a way that best addresses our needs.

• And create data models, APIs and tools as appropriate to support our applications.

• Or adapt existing data models, APIs and tools as appropriate to support our applications.

Page 20: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

Sample uses of HDF

Page 21: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 21

1. NASA Earth Observing System (EOS)

Aqua (6/01)Aura

TES HRDLSMLS OMI

Terra

CERES MISR

MODIS MOPITT

AquaCERES MODIS

AMSR

Page 22: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 22

2. Advanced Simulation & Computing (ASC)

Question: How do we maintain a nuclear stockpile in the absence

of testing?

Answer: Very large simulations

Page 23: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 23

ASC Data requirements• Large datasets (> a terabyte) • Fast I/O on massive parallel systems • Complex data and extensive metadata• Availability on leading edge systems

Page 24: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

3. Bioinformatics

--

Managing genomic data

caacaagccaaaactcgtacaacaacaagccaaaactcgtacaaCgagatatctcttggaaaaactCgagatatctcttggaaaaactgctcacaatattgacgtacaaggctcacaatattgacgtacaaggttgttcatgaaactttcggtagttgttcatgaaactttcggtaAcaatcgttgacattgcgacctAcaatcgttgacattgcgacctaatacagcccagcaagcagaataatacagcccagcaagcagaat

Page 25: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 25

DNA sequencing workflows are complex

• Diverse formats• Highly redundant data• Multiple levels of

information• Complex associations• Repeated file

processing• Non-scalable storage• Lack of persistence

Page 26: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 26

HDF5 as binary exchange format for bioinformatics

Page 27: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

4. Flight test data

Page 28: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 28

Boeing flight test

Page 29: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

HDF role in the Software Stack

Page 30: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 30

StorageStorage

File on parallelFile on parallelfile systemfile systemFileFile

Split metadata Split metadata and raw data filesand raw data files

User-definedUser-defineddevicedevice

?? Across the networkAcross the networkor to/from anotheror to/from another

application or libraryapplication or libraryHDF5 formatHDF5 format

HDF5HDF5 data model & API data model & API

Apps: simulation, visualization, remote sensing…

Examples: Thermonuclear simulationsProduct modelingData mining tools

Visualization toolsClimate models

Common application-specific data models

HDF5 virtual file layer (I/O drivers)HDF5 virtual file layer (I/O drivers)

MPI I/OMPI I/OSplit FilesSplit FilesStdioStdio CustomCustom StreamStreamHDF5 serial & HDF5 serial &

parallel I/Oparallel I/O

BioHDF SAF HDF-Packet HDF-EOSMatlabapp-specificapp-specific API or GUI

LANL LLNL, SNL Grids COTS NASA

Page 31: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

2. Why is there interest in HDF5 for

product data? (Courtesy of David Price, EuroSTEP)

Page 32: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 32

Needs• STEP and related models exist using

EXPRESS• ASCII, XML STEP formats defined,

software developed• But ASCII/XML don’t adapt well for

highly voluminous, complex data• Finite element analysis• Computational fluid dynamics• Heterogeneous product data

Page 33: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 33

EuroSTEP project• VIVACE: “Value Improvement through a

Virtual Aeronautical Collaborative Enterprise”

• Deliverable: EXPRESS-driven Large Volume Binary Data Representation

Page 34: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 36

Survey of State of the Art• Candidates

• ASN.1 : Abstract Syntax Notation 1• HDF5 : Hierarchical Data Format• XML/Binary• CGNS : CFD General Notation System• SDAI implementation by LKSoft

• Found HDF5 most suitable for very large scientific datasets and complex relationships

Page 35: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

Goal:Create open-source

toolkit mapping EXPRESS to HDF5

Page 36: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 38

StorageStorage

File on parallelFile on parallelfile systemfile systemFileFile

Split metadata Split metadata and raw data filesand raw data files

User-definedUser-defineddevicedevice

?? Across the networkAcross the networkor to/from anotheror to/from another

application or libraryapplication or libraryHDF5 formatHDF5 format

HDF5HDF5 data model & API data model & API

Apps: simulation, visualization, remote sensing…

Examples: Thermonuclear simulationsProduct modelingData mining tools

Visualization toolsClimate models

Common application-specific data models

HDF5 virtual file layer (I/O drivers)HDF5 virtual file layer (I/O drivers)

MPI I/OMPI I/OSplit FilesSplit FilesStdioStdio CustomCustom StreamStreamHDF5 serial & HDF5 serial &

parallel I/Oparallel I/O

BioHDF SAF HDF-Packet HDF-EOSMatlabappl-specificappl-specific

APIsLANL LLNL, SNL Grids COTS NASA

Product model Applications

Examples: Thermonuclear simulationsProduct modelingData mining tools

Visualization tools

STEP data models

STEP-HDF5

Page 37: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

NARA-sponsored work

Page 38: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 40

NCSA-THG NARA Research• Investigate the viability of scientific data

formats, such as HDF5, for long-term preservation of engineering data in the federal archives

Page 39: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 41

Heterogeneous data aggregation, with HDF5 • Goal:

Using NARA’s TWR collection, investigate the possibilities and limitations of using HDF5 as a container for archiving heterogeneous collections of records, with special attention to STEP data.

Page 40: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 42

Activities• Use files, datatypes, structures in NARA

TWR collection – STEP files, photos, schematics, etc.

• Map these to HDF5 objects and structures, exploiting features of HDF5

• Assess benefits and costs in terms of storage efficiency and accessibility

• Investigate use of HDF5 as container for collection

Page 41: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 43

Relationship EuroSTEP, Electric Boat, et al

• Working together to develop mappings from EXPRESS to HDF5

• Sharing data for testing• Periodic meetings to share information

and coordinate research• Some involvement with standardization

Page 42: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 44

Investigating I/O efficiency and size • Explore different datatypes and storage

options for b-spline surface models (later: finite element models)

• Two types of data – b-splines themselves and cartesian points

• Variables• Different HDF5 datatypes• Dataset compression• Use of extra indexes in HDF5 for fast access

Page 43: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 45

Some results• Small files

• HDF5 not appreciably better then STEP, sometimes worse

• Large files• Compression always made HDF5 files smaller• Even without compression, HDF5 storage better• Indexing approach also tended to save space

• Lessons• HDF5 can provide very efficient storage for

cartesian points• Choice of data types and data storage is important

Page 44: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

HDF5 as container

HDFView Demo

Page 45: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 47

Page 46: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

Thank you

Page 47: NCSA-NARA investigations of HDF5 in support of  EXPRESS-Driven data

PDES, Inc. Offsite Sept 2006 49

HDF Information• HDF Information Center

• http://hdfgroup.org/

• HDF Help email address• [email protected]/

• HDF users mailing list• [email protected]/