The Federated Data System DataFed:
Experiences in Data Homogenization and Networking
R.B. Husar, K. Hoijarvi, S. R. Falke, E. M. Robinson, Washington University, St. Louis
G. Leptoukh, NASA GSFC
Spring AGU, May 29, 2008, Ft. Lauderdale
DataFed in a Nutshell:
A Federation of autonomous, distributed data providersPerforms non-intrusive wrapping of data into web services
Provides service-based analysis services and tools
General Experience with DataFed:
It is an agile virtual data system can deliver info products to diverse usersThird-party mediation can homogenize distributed data on the fly
Since 2005, DataFed is used by EPA and in research
DataFed Motivated by GEOSS
DataFed development is guided by the meme of GEOSS
Five practices for agile, seamless data federation:
1. Space-Time Query for standardized access to all data (WCS)
2. Data Wrappers for turning heterogeneous data into web services
3. Data Mediators for transforming data into ‘Views’
4. Mashups for connecting autonomous application
5. DataSpaces for shared metadata by the users, for the users
Parameter-Space-Time Query Using OGC WCS Data Access Protocol
Regardless of the data location, data type and format,
• the parameter-space-time query is the same
• the return is in user selectable format from the offerings
Coverage=THEEDDS.T& BBOX=-126,24,-65,52,0,0 &TIME=2002-07-07/2002-07-07&FORMAT=NetCDFCoverage=SEAW.Refl& BBOX=-126,24,-65,52,0,0 &TIME=2002-07-07/2002-07-07&FORMAT=GeoTIFFCoverage=SURF.Bext& BBOX=-126,24,-65,52,0,0 &TIME=2002-07-07/2002-07-07&FORMAT=NetCDF-table
Grid Image Station Data
Parameter Bounding Box Time Range Out Format
DataFed wrappers are non-intrusive, third party
Third Party Data Wrappers Heterogeneous input data >>> Homogeneous (WCS) Query
Mediated User-Data InterfaceMediator turns data into Views
Mediated Integration is a flexible design pattern for System of Systems
Client-Server design is demanding:
User carries the burden of integration
Query
Data Views
SOAP
RDF
Mashup Workflow
Mashups: Loose Coupling of Autonomous Applications
DataFed – Wiki -- GoogleEarth
DataSpaces for Datasets
GEOSS Comp.Registry
CommunityAQ Portal
extracts
ServiceOfferor
registers
GEOSSClearinghouse
Catalog list Searches, harvests
invokes
referencespublishes
provides
Standards;SIF Registry
Adopted from Percivall, Feb 2008 by R. Husar, March 2008
CommunityAQ Catalog
CatalogUser
Service Workflow
composes DataAnalyst
visualizes
Reportsto
DecisionMaker
PolicyAnalyst
Informs
Services
find
CommunityDataSpaces
links to
GEOSS Core Service Offerors and Users
Shared Metadata by the Users, for the Users
GEOSS Comp.Registry
CommunityAQ Portal
extracts
ServiceOfferor
registers
GEOSSClearinghouse
Catalog list Searches, harvests
invokes
referencespublishes
providesStandards;
SIF Registry
Adopted from Percivall, Feb 2008 by R. Husar, March 2008
CommunityAQ Catalog
Service Workflow
composes
DataAnalyst
visualizes
Reportsto
DecisionMaker
PolicyAnalyst
Informs
Services
find
CommunityDataSpaces
links to
GEOSS Core Service Offerors and Users
Shared Metadata by the Users, for the Users
views
report
GEOSS Comp.Registry
CommunityAQ Portal
extracts
ServiceOfferor
registers
GEOSSClearinghouse
Catalog list
Searches, harvests
invokes
references
publishes
provides
Standards;SIF Registry
Adopted from Percivall, Feb 2008 by R. Husar, March 2008
CommunityAQ Catalog
Service Workflow
composes
DataAnalyst
visualizes
DecisionMaker
PolicyAnalyst
Informs
Services
find
CommunityDataSpaces
links to
GEOSS Core Service Offerors and Users
Shared Metadata by the Users, for the Users -+
views
report
Wiki ‘DataSpaces’Creating and Sharing Metadata
Community Catalog - Find Dataset
Describe Dataset
Discuss Dataset
ESIP Communal Wiki
• Semantic Wiki: Structured (RDF and Unstructured Content
• Open, Standard Matadata - RDF
• Ready for Export/Harvesting by Registries, Catalogs
Sharing Best Practices: GEO Best Practice Wiki
Developments and Challenges:
Favorable Engineering Developments:
• A Core network for Air Quality data sharing is emerging.• Standards are available for sharing previously unstructured data• Third-party mediation can homogenize the distributed data• Agile SOA-based systems can deliver info products to diverse users• Since 2005, one such IS, DataFed is used by EPA and in research
However:
• Service interfaces are still uneven; networks are still fragile• The utility of social networking in science is not understood• Users can not provide feedback to upstream providers • Many cultural, legal and other barriers hamper progress
ESIP Coordination Application