and thredds data server - seegrid.csiro.au c library with fortran, c++, perl, idl, matlab, python,...

24
Unidata's Common Data Model and THREDDS Data Server John Caron and Ben Domenico Draft last modified: December 23, 2006 << This is Ben's attempt to put John's January 2006 ESIP Federation presentation into HTML>> Overview This document describes the Unidata Common Data Model (CDM). It begins with a set of definitions and descriptions of the underlying netCDF, HDF, and OPeNDAP technologies. Those descriptions are followed by an outline of the CDM as an effort to fuse the best characteristics of the existing data models which is more powerful than each of the others, but maintains the fundamental simplicity and ease of use of the original netCDF. Subsequent sections describe the fundamental scientific data types, the coordinate system layer and data access methods. This is followed by a description of the netCDF Markup Language (ncML), a dialect of XML, that can be used to represent and augment netCDF metadata. It is also the mechanism for netCDF dataset aggregation. Status updates and examples are provided where appropriate. NetCDF-3 The netCDF-3 is a machine and operating system independent file format for “self-describing” scientific data. It provides for efficient subsetting of multidimensional arrays and is in very widespread use with more than 20,000 downloads last year netCDF is availalbe two forms: a C library with Fortran, C++, Perl, IDL, MatLab, Python, Ruby interfaces an independent Java library which serves as the prototyping area for new functionality HDF-5 HDF-5 is also a m achine and OS independent file format for “self-describing” scientific data. It has evolved from HDF-4, but has some fundamental differences. HDF-5 is available as a C library with Fortran, Java, and PyTables interfaces. A few of the special featurea of HDF are parallel-IO, chunked storage, compression filters, and many data types. It ware originally eveloped at NCSA, but is now independently developed and maintained. Several specialized forms of HDF are in use, e.g., HDF-EOS, HDF5-EOS, which are standard formats for EOSDIS, ASCI, and NPOESS. NetCDF-4 NetCDF-4 is the result of a a p roject funded by NASA to create new version of netCDF using the HDF5 file Data Tools Community Downloads Support Projects About Us Login

Upload: phungnguyet

Post on 02-May-2018

220 views

Category:

Documents


1 download

TRANSCRIPT

Unidata's Common Data Model and

THREDDS Data Server John Caron and Ben Domenico

Draft last modified: December 23, 2006

<< This is Ben's attempt to put John's January 2006 ESIP Federation presentation into HTML>>

Overview This document describes the Unidata Common Data Model (CDM). It begins with a set of definitions and descriptions of the underlying netCDF, HDF, and OPeNDAP technologies. Those descriptions are followed by an outline of the CDM as an effort to fuse the best characteristics of the existing data models which is more powerful than each of the others, but maintains the fundamental simplicity and ease of use of the original netCDF. Subsequent sections describe the fundamental scientific data types, the coordinate system layer and data access methods. This is followed by a description of the netCDF Markup Language (ncML), a dialect of XML, that can be used to represent and augment netCDF metadata. It is also the mechanism for netCDF dataset aggregation. Status updates and examples are provided where appropriate.

NetCDF-3 The netCDF-3 is a machine and operating system independent file format for “self-describing” scientific data. It provides for efficient subsetting of multidimensional arrays and is in very widespread use with more than 20,000 downloads last year

netCDF is availalbe two forms:

a C library with Fortran, C++, Perl, IDL, MatLab, Python, Ruby interfaces an independent Java library which serves as the prototyping area for new functionality

HDF-5 HDF-5 is also a m achine and OS independent file format for “self-describing” scientific data. It has evolved from HDF-4, but has some fundamental differences. HDF-5 is available as a C library with Fortran, Java, and PyTables interfaces. A few of the special featurea of HDF are parallel-IO, chunked storage, compression filters, and many data types. It ware originally eveloped at NCSA, but is now independently developed and maintained.

Several specialized forms of HDF are in use, e.g., HDF-EOS, HDF5-EOS, which are standard formats for EOSDIS, ASCI, and NPOESS.

NetCDF-4 NetCDF-4 is the result of a a p roject funded by NASA to create new version of netCDF using the HDF5 file

Data Tools Community Downloads Support Projects About Us • Login

format. The goal has been to “Extend and merge” netCDF and HDF5, so as to take advantage of the widespread use and simplicity of netCDF while maintaining the generality and performance of HDF5.

NetCDF-Java 2.2 (nj22) NetCDF-Java 2.2 is a 100% Java library which includes a prototype implementation of the CDM. This netCDF API supports access several file formats:

General: NetCDF, HDF5, OPeNDAP Grids: GRIB1, GRIB2 Radar: NEXRAD, NIDS, DORADE Satellite: DMSP, GINI

and provides access to THREDDS catalogs.

OPenDAP OPeNDAP is a client-server protocol for scientific data access. The package includes:

a C++ client and server, Java client and server libraries

The current version, 2.0, is a NASA ESE standard. The OPeNDAP team are working on a new 4.0 protocol spec.

THREDDS THREDDS development was originally funded by the NSF National Science Digital Library initiative to faciltitate “discovery and use of scientific data.” It provides middleware between data providers and users. The middleware is built around Dataset Inventory Catalogs which are implemented in XML. Within the Unidata community THREDDS is seen as a data "pull" alternative to the well-established subscription-based, "push" technology of the Unidata Internet Data Distribution (IDD) system which uses the Local Data Manager (LDM). The THREDDS project is now supported partially (and somewhat tenuously) by Unidata core funding.

What's a Data Model? In a philosophical sense, a data model is a way of thinking about scientific data. It’s an abstraction. Some of these data model abstractions have been incorporated into systems for storing and accessing scientific data. Where the data models differ significantly, it can be challenging to make the data systems interoperate with one another, which in turn, can stifle interdisciplinary research by hindering integrated analysis and viewing of multiple datasets from different domains. In computer temrs, a data model can be thought of as equivalent to an abstract object model in Object Oriented Programming in that an Abstract Data Model describes data objects and what methods you can use on them.

What Forms Do Data Models Take? An abstract data model can be instatiated in several forms, for example:

An API is the interface to the Data Model for a specific programming language A file format is a way to persist the objects in the Data Model A data access protocol plays the role of a file format

The Abstract Data Model, on the other hand, removes the details of any particular API and the persistence format in which the datasets are actually stored..

The Objective A central goal of the CDM effort is to create a Common Data Access Model from NetCDF, HDF5, OPeNDAP

Existing Data Models <<John needs to check the text descriptions of these data models.>>

NetCDF-3 The netCDF-3 data model shown in the Universal Modeling Language (UML) diagram below is fairly simple. A dataset has dimensions, variables, and attributes. Attributes can be global or apply to individual variables. There is a very limited set of low level data types.

netCDF-3 Data Model UML Diagram

OPeNDAP The OPeNDAP data model has many things in common with netCDF. But t has a richer set of low level data types and includes structures, sequences and grids.

OPeNDAP (DAP-2) Data Model UML Diagram

HDF-5 HDF-5 has a much richer set of low level data types and includes the key feature of a group of variables. As with OPeNDAP, HDF-5 includes structures.

HDF-5 Data Model UML Diagram

Common Data (Access) Model At the data access level, the CDM maintains as much as possible of the elegance of the netCDF-3 inteface, but add important features from OPeNDAP and HDF, most notably:

more low level data types -- including "string" structures groups

Common Data Model (data access layer) UML Diagram

Creating a Common Data Model for netCDF, OPeNDAP, HDF As noted at the outset, the CDM is an effort to fuse the best characteristics of the existing data models which is more powerful than each of the others, but maintains the fundamental simplicity and ease of use of the original netCDF. The resulting CDM consists of several layers The top layer provides interfaces to a set of scientific data types. The middle layer provides access to coordinate system information, and, at the bottom lies the actual data access layer.

Common Data Model Layers

Coordinate Systems Layer The netCDF, OPeNDAP, HDF data models do not have integrated coordinate systems, so georeferencing is not a part of the API. As a consequence, the coordinate system information is inferred. In the best case, the files confrom to a set of established conventions (eg CF-1, COARDS, etc). << Need help from John here.>> In contrast, GRIB, HDF-EOS, other specialized formats. However, in the CDM, the coordinate system information must be handled in a general way. The approach is shown in the following diagram.

CDM Coordinate System UML Diagram

Scientific Data Types The top layer of the CDM carries the semantics. In its inital form, it's based on datasets types familiar to the Unidata . In concept, it is designed to scale to large, multifile collections and will eventually support "specialized queries," but, at present the APIs are still evolving.

<< Need John's notes for the following>> How are data points connected? Space, Time Corresponding “standard” NetCDF file conventions

Point Data

Point Data Examples

PointObsDataSet Methods

// Collection of StructureData Collection getData( LatLonRect boundingBox, Date start, Date end);

Station Data

Station Data Illustrations

StationObs Methods

// return List of Station List getStations();

// return List of StructureData List getData( Station s, Date start, Date end);

Trajectory Data

Trajectory Data Visualization

TrajectoryObs Methods

int getNumPoints(); StructureData getData(int point);

Radial Data

Radial Data Illustrations

Radial Methods

interface Radial { int getNumGates(); float getData(int gate);

float getStartingGate(); float getGateSize(); float getElevation(); float getAzimuth(); double getTime(); }

Grid Data

Grid Data Illustrations

Grid Methods

interface GridCoordSys { CoordinateAxis getTaxis(); CoordinateAxis getXaxis(); CoordinateAxis getYaxis(); CoordinateAxis getZaxis(); Projection getProjection(); } Array getDataCube(Range time, Range z, Range y, Range x);

Image Visualization

Swath Illustration

Standardizing NetCDF Formats For gridded datasets, the CF-1 Convention is the accepted standard. However, CF-1 still needs improvements top handle the output of some of the newer forecast models such as the WRF and for dealing with GIS information -- especially in cases where the datum describing the exact shape of the Earth is required. <<Ben needs to get this section right>>

for Radar data, a “Radar Exchange Format” is in use and evolving within the radar community (led by NCAR ATD). Unidata is working on a set of Observation Dataset Conventions for Point Observation Datasets.

CDM Implementations NetCDF-4

NetCDF-4 C Library

NetCDF-4 Status

NetCDF 4.0 Beta, which implements the CDM access layer, is essentially complete, but waiting for HDF5 release 1.8 to finalize the file format. In the meantime, release 4.1 will add Coordinate Systems and a future 4.? will merge OPeNDAP access (pending funding).

NetCDF-Java 2.2 A prototype implementation of the CDM is available as a part of netCDF-Java 2.2 (nj22). It currently supports several file formats:

General: NetCDF, HDF5, OPeNDAP Grids: GRIB1, GRIB2 Radar: NEXRAD, NIDS, DORADE Satellite: DMSP, GINI

In addition, it provides access to THREDDS catalogs and implements NcML.

As a reminder, the CDM layer diagram is shown below followed by a depiction of the architecture as implemented in nj22.

Common Data Model Layers

NetCDF Java Version 2 Architecture

NetCDF Java 2.2 Status

At present, the Data Access layer is available at Beta quality, but it is also waiting for HDF5 release to finish the NetCDF-4 component. However, users can still safely commit to API at this time. The Coordinate Systems layer is at an "early Beta" stage, so it is a little rougher than the Data Access layer. The CDM staff is still finishing documentation and runtime plugability. The Data Types layer is still very much under development . It's at the Alpha stage where there is still experimentation with the APIs.

NetCDF Markup Language (ncML) NcML is an XML representation of netCDF metadata. It can be though of like "ncdump -h." In the other direction, it can also be used to as a way to specify a new netCDF file to be created. In that sense, it is like like ncgen. It can also be used to modify existing datasets or expand on the metadata associated with them. It serves to add, delete, and or rename netCDF files. n addition, one can use ncML to create logical sections of existing variables and to create unions and aggregations of multiple existing datasets.

NcML Example

<?xml version="1.0" encoding="UTF-8"?>

<netcdf xmlns="http://www.unidata.ucar.edu/schemas/netcdf/ncml-2.2" location=“/data/nids/N0R_20041119_2147">

<attribute name=“DataType" value=“Radar" /> <remove type=“attribute” name=“password" /> <variable name="Reflectivity" orgName=“R34768”> <attribute name="units" value=“dBZ" /> </variable>

</netcdf>

NcML Aggregation

<<Need John's notes in this area.>>

Union

Join (Existing)

Join (New)

Forecast Model Run

NcML Aggregation Example

The following example illustrates the use of ncML to aggregate a time series of Pressure and Temperature data using a "join" over multiple files in the directory “C:/data/goes/" with suffix ".gini".

<netcdf xmlns=“http://www.unidata.ucar.edu/schemas/netcdf/ncml-2.2”> <aggregation dimName="time" type="joinNew"> <variableAgg name="Temperature"/> <variableAgg name="Pressure"/> <scan location=“C:/data/goes/" suffix=".gini"/> </aggregation> </netcdf>

The THREDDS Data Server (TDS) The THREDDS (THematic Real-time Environmental Distributed Data Services) Data Server integrates data access with THREDDS catalogs and services. TDS is written as a Tomcat Servlet in100% Java and is available as a single WAR file. Client applications access server catalog information from a THREDDS catalog contained in an XML file on the server. The TDS accesses data the netCDF Java 2.2 library. It serves data to client applications via a number of alternative protocols:

OPeNDAP HTTP Server OGC Web Coverage Server (gridded)

The schematic architecture is shown in the diagram below.

THREDDS Data Server (TDS) Schematic

TDS as a Gateway

Because the TDS uses the netCDF Java library to access data, it can access data via the OPeNDAP protocol. Hence the data may actually be stored on a remote machine in a format accessible via OPeNDAP. The netCDF Java library takes care of the needed transformations for serving the data to the WCS client application. In other words, one TDS host can provide WCS access to datasets stored on a number of distributed OPeNDAP servers. This service is shown in the diagram below.

TDS as a Gateway between WCS Client and remote OPeNDAP Servers

The TDS and ncML The netCDF Markup Language (ncML) can enhance the TDS capabilities. A TDS can be configured to serve datasets “wrapped” by ncML. However, the client application sees only the protocol interface (OPeNDAP or WCS), not the NcML. With this approach, the server is able to “fix” metadata problems or augment metadata -- without changes to the data files themselves. One important function of the ncML approach is to implement aggregation of data contained in multiple files on the TDS. In that regard, the combination of TDS with ncML replaces the old OPeNDAP/DODS “Aggregation Server.”

TDS and ncML

TDS and Standard Protocols The GALEON (Geo-interface for Air, Land, Earth, Ocean NetCDF) Interoperability Experiment within the Open Geospatial Consortium (OGC) is an effort to determine the suitability and effectiveness of the OGC Web Coverage Service (WCS) interface to datasets available on servers within the "Fluid Earth Sciences (FES)" community -- mainly atmospheric sciences, and oceanography. Most of those datasets are available via netCDF, HDF, and OPeNDAP interfaces and protocols. Thus the TDS utilizing the CDM is an ideal combination tools for exposing many of these dataset via the WCS interfaces.

The WCS protocol interface to underlying TDS components is shown in the diagram below.

WCS Protocol Gateway Interface to TDS Components

It should be noted that geoTIFF and GML (Geography Markup Language) are among the five binary encoding formats currently recognized in the WCS specification while netCDF is not. Among the recommendations stemming from GALEON is that netCDF be added to the list as a sixth WCS encoding format. Another key effort within GALEON is the development of an "applications profile schema" called ncML-GML which is an XML dialect which combines GML specifications for coordinate system metadata with ncML constructs for the scientific metadata.

An early version of ncML-GML is described in Design and implementation of netCDF markup language (NcML) and its GML-based extension (NcML-GML), Computers & Geosciences, Volume 31, Issue 9, November 2005, Pages 1104-1118. http://www.sciencedirect.com/science/article/B6V7D-4GHSGN4-2/2/6bc151125c99352396f3aa7c630919e4). More information about GALEON, including project status updates, is available in a GALEON wiki: http://galeon-wcs.jot.com/WikiHome. A second GML application profile called the Climate Science Modelling Language (CSML, http://ndg.nerc.ac.uk/csml/) is under development at the Natural Environment Research Council (NERC) in the UK.

TDS and Digital Libraries Combined with THREDDS catalogs and ncML, the TDS provides a framework to add metadata on the server that can be incorporated into Digital Libraries and other Discovery Services such as the NASA GCMD (Global Change Master Directory) and the NCAR CDP (Community Data Portal). This can be accomplishe by hand for metadata descriptions at the collection level where entire classes of data encompassing many datasets are cataloged. Alternatively it can be implemented at a much finer granularity (inventory level) using tools for automatic extraction of metadata from the datasets themselves. This metadata can be made available for harvesting by Digital Libraries or the TDS can be configured to send records to existing discovery centers. Note that there is no search system built in as part of the current TDS. For now the assumption is the search systems will be provided as the discovery centers.

The diagram below shows the role of the TDS in providing search metadata from distributed data servers to Digital Libraries and other discovery centers.

The TDS, Digital Libraries, and Other Discovery Centers

Future Plans NetCDF-Java The immediate goals for the netCDF Java effort are to get the APIs stable, write the documentation, and provide for runtime plugability. As netCDF-4 is completed, it will be incorporated into the Java API. Beyond that, efforts to fully integrate HDF4, HDF-EOS, BUFR will require additional funding.

NetCDF-4 C Library At this time, it is not completely clear which netCDF Java capabilities will be incorporated in the netCDF-4 C library. For example, the Scientific DataTypes are simply too immature to port. Moreover it is uncertain how and whether ncML will become part of the C Library. Java on the server

TDS The most immediate TDS task is to complete the aggregation functionality. Future functionality and datasets to be addressed in the TDS will be driven largely by the needs of the Unidata community as evidenced in their use of the prototype TDS running on the server known as "motherlode" where the data from Unidata's real-time push data delivery system (IDD) are made available via TDS.

Additional enhancements under consideration at this time include:

Pluggable authorization Access control by dataset Improved performance Services

Coordinate System Verifier (e.g. CF-1)

Data access Subset and get netcdf file

References Unidata Glossary http://www.unidata.ucar.edu/publications/acronyms/glossary.html

Document listing netCDF, CF Conventions, ncML, ncML-GML, OGC, ISO web pages http://www.unidata.ucar.edu/projects/THREDDS/GALEON/netcdfAndCFwebpages.html

netCDF: http://www.unidata.ucar.edu/software/netcdf/

The NetCDF Users' Guide: http://www.unidata.ucar.edu/software/netcdf/docs/netcdf.html

NetCDF Java: http://www.unidata.ucar.edu/software/netcdf-java/

Common Data Model: http://www.unidata.ucar.edu/software/netcdf/CDM/index.html

Climate and Forecast (CF) Metadata: http://www.cgd.ucar.edu/cms/eaton/cf-metadata/

CF standard name table: http://www.cgd.ucar.edu/cms/eaton/cf-metadata/standard_name.html

Standard Units: http://www.unidata.ucar.edu/software/udunits/

BADC Datasets: CF conventions: http://badc.nerc.ac.uk/help/formats/netcdf/index_cf.html

NetCDF Markup Language (ncML): http://www.unidata.ucar.edu/software/netcdf/ncml/

NcML Coordinate System Extension (NcML-CS): http://www.unidata.ucar.edu/software/netcdf-java/CoordinateAttributes3.html

Design and implementation of netCDF markup language (NcML) and its GML-based extension (NcML-GML), Computers & Geosciences, Volume 31, Issue 9, November 2005, Pages 1104-1118. http://www.sciencedirect.com/science/article/B6V7D-4GHSGN4-2/2/6bc151125c99352396f3aa7c630919e4)

NcML Geography Markup Language (NcML - GML): http://www.gmldays.com/gml2005/presentations/ncML-GML%20v.0.3.2,%20Ben%20Domenico.pdf

http://www.unidata.ucar.edu/projects/THREDDS/

http://hdf.ncsa.uiuc.edu/

http://www.unidata.ucar.edu/

http://galeon-wcs.jot.com/WikiHome

Climate Science Modeling Language (CSML): http://ndg.nerc.ac.uk/csml/

http://www.nerc.ac.uk/

https://cdp.ucar.edu/

http://gcmd.gsfc.nasa.gov/

Unification of the Georeferencing Systems of GIS Spatial Data Infrastructure http://www.gisdevelopment.net/application/miscellaneous/me05_017.htm

Contact Us • Site Map • Search • Terms and Conditions • Privacy Policy • Participation Policy

Unidata is a member of the UCAR Office of Programs, is managed by the University Corporation for Atmospheric Research, and is sponsored by the National Science Foundation. P.O. Box 3000 • Boulder, CO 80307-3000 USA • Tel: 303-497-8643 • Fax: 303-497-8690