the nerc datagrid
Post on 19-Jan-2016
37 Views
Preview:
DESCRIPTION
TRANSCRIPT
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
TheThe NERCNERC DataGridDataGridTheThe NERCNERC DataGridDataGrid
Bryan Lawrence, BADC
David Boyd
Kerstin Kleese
Roy Lowry
Dean Williams
Bob Drach
Mike Fiorino
Deputy Director CLRC e-Science centre
DL: Climate Database Expert
BODC: Marine Database Expert
PCMDI: ESG Principle Investigator
PCMDI: ESG Metadata Architecture
PCMDI: Meteorologist
Acronym Summary:
PCMDI: Program for Climate Model Data Intercomparison
(US Department of Energy, Lawrence-Livermore National Lab)
ESG: Earth System Grid
(US Grid Project: NCAR, Argonne, PCMDI, USC …)
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Outline
• Motivation• The Earth System Grid
– definitions of “portals” and applications– ontologies
• Relations with other NERC e-science programmes.• Architecture
– querying– software Stack
• Initial steps and Project Management• Connectivity with other grid projects• Success and Failure• Summary of what we are doing and the road to the
future
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
The BADC – part of NCAS!
The Role: Key words: Curation and Facilitation!http://www.badc.rl.ac.uk
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Just under half of BADC users are NOT atmospheric scientists:
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Motivation – Town meeting 2001
E-science should be involved with:• delivering an enhanced meta-data record of archived
data.• 'dictionary' building.• building systems to translate data and link databases.• integrating computer and natural science communities.• the ability to generate a single query across multiple
datasets (in different catalogues) returning both metadata and data.
• the ability to acquire large datasets in near real time (NRT).
• the automatic production of metadata, both by models, and where possible, by observing systems.
Summary from two of the four working groups!
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Relevant to many stakeholders
Energy
Water Management
Food Chain
Health
WeatherRisk
(Slide from Julia Slingo’s introduction to CGAM as part of NCAS)
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Motivation
Page 22:
NERC will …... ensure that Earth system science is underpinned by e-science investments to enable access, manipulation … of data from diverse sources.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
The Data Use Chain
Discovery
Authentication
Authorisation
Extraction
Sub-Sampling
Regridding
Processing Display
Delivery
Formatting
Time-line
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
NERC Metadata Gateway - SST
• Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time!
•And if I want to compare data from different locations?
- multiple logins
- multiple formats
- discovery?
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Searching: need comprehensive metadata!
A priori would any user know to look in the COAPEC data set?
Earth system-science means we have to remove these boundaries!
• detailed file level metadata isn’t visible, and so data mining applications impossible.
- need ontologies to help queries match actual data descriptions.
NB: Dynamic catalogues!
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
What is an Ontology?
An ontology defines the terms used to describe and represent an area of knowledge by specifying the following kinds of concepts:
•Classes (general things) in the many domains of interest •The relationships that can exist among things •The properties (or attributes) those things may have
Ontologies are usually expressed in a logic-based language, so that detailed, accurate, consistent, sound, and meaningful distinctions can be made among the classes, properties, and relations..
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Ontology Example:
An example of part of ontology defined using OIL (e.g. see Oil in a Nutshell, D. Fensel et.al.)
ontology-definitions slot-def eats inverse is-eaten-by slot-def has-part inverse is-part-of properties transitive
class-def defined carnivore subclass-of animal slot-constraint eats value-type animal class-def defined herbivore subclass-of animal slot-constraint eats value-type plant OR (slot-constraint is-part-of has-value plant)
With current funding, the NDG does not aim to build a formal ontology, but we do aim to being to build a thesaurus that can form the basis of one, and we do hope to spin off a project to build one and integrate it in the NDG
class-def animalclass-def plant subclass-of NOT animal class-def tree subclass-of plant class-def branch slot-constraint is-part-of has-value tree class-def leaf slot-constraint is-part-of has-value branch class-def
class-def giraffe subclass-of animal slot-constraint eats value-type leaf class-def lion subclass-of animal slot-constraint eats value-type herbivore
Relationships
Classes
Properties
(OIL: Ontology Inference Layer)
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
ESG: Example of a Web-based Data Portal
ESG will provide support for:
• large but simple data sets,
• limited metadata, but not searchable.
NDG will provide support for
•Small-but-complex datasets.
•Data-mining (searchable metadata).
NDG is complementary to ESG!
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Live Access Server (1)
… we will keep the basic structure, but gradually replace components.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Live Access Server (2)
Data Request Structure:
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
ESG: Example of a Client Application
We will:
• Provide python based classes for our observational data to complement the access to 3D gridded data.
• Provide a web services wrapper so that other grid applications can access NDG data.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Applications and Portals
Wider InternetNERC Grid
taperobot
XML data-base
XML data-base
BADC NDG Wrapper
OnlineData
OnlineData
BODC NDGWrapper
OnlineData
XML data-base
Group NDGWrapper
Software Agent
Grid User
Satellite Supercomputer
Research Group DataSources
Internet Link
Internet User
Internet LinkESG (&other)Applications
Wider Internet
NDGWeb
Portal
XML data-base
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Relationship to GODIVA (Haines et.al.)(Grid for Ocean Diagnostics, Interactive Visualisation and Analysis)
Architecture of the GODIVA Grid: NDG will:
• improve data discovery tools for GODIVA (even for their own datasets).
• provide metadata creation tools for GODIVA participants.
• provide access to data held outside GODIVA participants.
GODIVA team have already discovered issues with the XML database
interface they are going to use.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
ClimatePrediction.com
•Scientific•investigators
•Participants &•policy-makers
•Summary•statistics
•100Tb of key output at 10-20 sites
•1Pb total output on 1M participants’ PCs
•ESG-II/NERC •DataGrid•GridFTP
•HTTP (DODS URL) •Live Access Server
•HTTP •HTTP
•Datamining •Peer-to-peer •visualisation
•Conventional FTP/HTTP
•Obs CP.COM will need the NDG to make best use of
observational data in evaluating their parameter space.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Mining on the Grid
Grid Mining Agent
Grid Processor
Satellite Data
Archive X
Satellite Data
Archive Y
Grid Mining Agent
Grid Processor
Grid Mining Agent
Grid Processor
From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Data mining: Grid Miner Architecture
IPG Mining Agent
IPG Processor
MiningDaemon
ControlDatabase
IPG Processor
IPG Mining Agent
IPG Processor
Mining OperationsRepository
IPG Processor
Data
Archive X
Satellite Data
Archive Y
MiningConfiig
Info
IPG Processor
From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002
The devil is in the detail: how does the
data mining agent get at the data?
Need data mining clients – objects which can read specific datatypes and present themselves to agents!
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Finding data: Querying!
• Requires databases of metadata & querying those databases.• Each part of the NDG will have an internal metadata catalogue (&/or
database), and data (either in flat files or the database).– so the querying strategy must support centralised querying on partially indexed
data, followed (if necessary) by distributed querying, which may or may not need mapping into a local database schema.
– In the grid environment the indexes themselves will be replicated, and some data may also be replicated.
• Major NDG design issue: developing appropriate data models, database schema and indexing strategies!– This is not a generic problem, it will be specific to our datatypes.– Technology needs to be public domain (i.e. free) for uptake!– NDG approach to database technology will be developed in conjunction with
DBTF.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Query Pathway; software components NERC DataGrid
Exi
stin
g a
nd
Re
qu
ire
dG
rid
Mid
dle
wa
reN
ew
Da
ta I
nte
rfa
ces
an
d S
erv
ice
sE
xist
ing
Da
ta a
nd
Se
rvic
es
Ap
plic
atio
n L
eve
l
QueryDistributor
(CheckAuthentication)
Query Handler
Response: DataSetMetadata
UserAssessment
inadequate
GenerateExpansion
Query (e..g:time and space)
Query Distributor(Check
Authorisationagainst "Locating")
"Dataset"Catalogue Search
(CheckAuthorisation
against "Looking")
ReformatMetadata Query Handler Granule
Catalogue Search;Return
SatisfactoryGranule Metadata
PotentiallyInteresting Data Exists?
Continue toExtraction?
Yes
CheckAuthorisation for
"Extraction"
Exit or return toprevious step at
this level
No
Not OK
DefineRequirements
for Sub-Sampling andReformatting
OK for extraction
Extract DataFile
Sub-Sampleand Reformat
Deliver Data toProcessor (s)(and cache)
UserProcessing,
Displayand/or
Visualisation
User Query
Interfaces:NERC
internationalgeneric
Discoveryand Extraction
Path
New Model andData Ingestionand Metadata
CreationInterfaces
Data Pathinto Archive
Data andMetadataArchives
CollateMultipleReturns
Data Extraction Path for Known Datasets
Network Pathand Cache
Identification
Parallel Queries Parallel Queries
BNL V1.01 - 12/01
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
DataCentreData
RDBMS
GranuleCatalog
XMLIngestor
NERC DataGrid Information Structure
DataFile
010010010
DataFile
010010010
DataFile
010010010
Structured Ingestor(e.g. cdscan)
Docs DocsDocs RawData
010010010
RawData
010010010
RawData
010010010
Raw Ingestor(e.g. for PP & Grib data)
Descriptor files
Documentation Ingestor
Catalogue XML
LibraryCataloguedatabase
DMSData Manipulation
System
DataCataloguedatabase
DSSDistributed Search
System
PythonAPI
RDBMSIngestor
Docs
WebInterface
GUIInterface
PCMDI Components
NDG Components
Joint Interfaces
Information Structure
Existing Components
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Simplified Software Stack
Key point:make use of existing technology, allow component replacement with time!
Achievable by:interface definition and integration.
Note: Any application will be able to access our data services via the OGSA wrapper in the middleware.
Existing ESG toolsbut ANY application will be able to
call NDG services.
Globus Middleware Layer
New NDGComponents
NDGEnhancements
to existingESG
Components
Key
Existing Data
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Software stack
Existing ESG toolsbut ANY application will be able to
call NDG services.
Globus Middleware Layer
New NDGComponents
NDGEnhancements
to existingESG
Components
Key
Existing Data
GUI Application Web Client
Data Access Instantiation XML Parsing
Access &Authorisation
ObjectOrientated
Class DefinitionsXML Schema
DataBase APILibraries
Data File APILibraries
Data Files, Databases
XML Data I/O
Query HandlerProcessing Options(Python Packages)
NERC DataGrid Software Stack
Network Transport Layer - GlobusGridFTP/DODS
NERC DataGrid API (Python)
Web Service/OGSA wrapper
XML Descriptor Files
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
NDG: Ingestion TasksNERC DataGrid: BADC Data Ingestion
BNL 03/01/02
DataFiles
010010010
Docs
RawData
010010010
Generate XML forGranule Catalog
Generate XML forDataSet Catalog
Generate XML forLibrary Catalog
Docs
Docs
Raw Data Input: - dataset documentation - binary data files - possibly doc files with individual data files
Phase One: Produce "Self Describing Data" (e.g. NetCDF).Phase Two: Generate XML MetadataPhase Three: Ingest Metadata into catalogues, and relocate files
IngestMetadata,
Relocate Files
Normally desirable to directly ingest data already in self-describing format(along with additional documentation)!
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Draft Project Schedule
Phase One Delivery
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Metadata Gateway
zserver
MetadataGateway(zgate)
otherzserver
Z39.50 Z39.50
BADC SGMLMetadata
isite index
UserWeb
Browser
other SGMLMetadata
Existing NERC Metadata Gateway (BADC perspective) SJP 12/06/01BNL 02/01/02
BADC Data,Docs & StaticWeb Pages
BADC MetadataINGRES
Catalogue
BADC MetadataDynamic HTMLDataset Pages
UserFTP
Interface
badc.rl.ac.uk
tornado.badc.rl.ac.uk
badc.rl.ac.uk
returns link to HTML pages
browse www.nmp.rl.ac.uk
NB: All metadata isat the dataset
collection level. Noinfo for individual
data files or fields!No actual data is
returned!
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
NERC DataGrid: Phase One Architecture
User WebBrowser
NERCDataGrid File
RequestManager
PCMDICDMS(LDAP
registry)
Live AccessServer
BADCDisk
Farms
BADCTapes
New BADCStorage
Environment
SRBMCAT
Datasets supported at phase one will be existing 3D data such as ECMWF and Met OfficeUM analyses at the BADC, and UM simulation data in university groups
Phase one depends on theintegration of existingtechnologies:
- SRB- LDAP- CDAT/CDMS- XML cataloging- Live Access Server- Cookies, and Unix authentication- wraping Z39.50 inWDSL (Zoom)?
along with a new requestmanager.
UM Data Files heldin Uni Res. Grps
dataflow pathway
registry pathway
IngresMetadata DB
Web ServerPerl Scripts
Existing BADC Technology
NERCMetadataGateway
registry pathway
Replace with
GlobusGiggle?
Next steps include:
•Replacing the transport layers in the metadata gateway with SOAP
•Replacing the SGML in the metadata gateway with XML
…etc
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Connectivity? Evolution!Innovation?
Plagiarism: Copying from one person
… we can’t afford to be too innovative!
Research : Copying from many people
NERC DataGridESG IIEU
DataGridWP9
UKDataBase
Task Force
ClimatePrediction.com
? Future ?Other
Programmes
U.S.Thredds/NOMADS
DigitalLibraries(Zoom)
Ontologies- Nesc
-MyGrid
QinetiQCEOSBNSC
CLRCe-science
Data Portal
BADC BODC
PARADISEGODIVA
NERC DataGridESG IIEU
DataGridWP9
UKDataBase
Task Force
ClimatePrediction.com
? Future ?Other
Programmes
U.S.Thredds/NOMADS
DigitalLibraries(Zoom)
Ontologies- Nesc
-MyGrid
QinetiQCEOSBNSC
CLRCe-science
Data Portal
BADC BODC
NEODCOther
DDC-CEH
PARADISEGODIVA
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Indicators of Success
Finding and making use of data:
• Possible to find, reformat, and visualise disparate datasets from disparate organisations within one application.
• No longer necessary to rely on personal contacts to locate and acquire data of interest if it’s held in the BADC/BODC.
• Key requirement for interdisciplinarity; the ability to test data comparison ideas without learning foreign formats and establishing personal relationships every time.
• Other NERC data designated data centres implementing NDG.
Take up by community:
• NDG software (but not necessarily graphics tools) in use in GODIVA project and in wider UK university community (including data repositories in research groups).
• Earth System Grid uses NDG components.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Risks Of Failure• Someone else does it first – unlikely!• Performance too slow for users!
– More cache and replication– Improve database performance (UK DBTF!)– Data-compression layer for XML– Reduce scope and search depth (don’t want to do this!)
• Globus 3 (OGSA) delivery heavily delayed– Web services implementation + Globus2 + datagrid service registry
• Availability of people with appropriate skills– re-deploy existing staff where possible– Schedule begins with three months training.
• ESG-II architecture delayed or incompatible with UK architecture– Close relationship with PCMDI means we will be able to proceed
effectively anyway.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
XML Catalogue
Server
1
NDG expected evolution
Computation
At USER Institution
Data Repositories
DataFile
010010010
Other: e.g. PML/ESSC
NERC DDC
DataFile
010010010
2
Catalogue Client
Computation
Graphics
Based on LAS
Satellite
Local Catalogue
CatalogueIngestor4
3
Python API
CatalogueClient
Computation
Evolving to OGSA 5Docs
6
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Beyond the next three years: The NDG and earth systems science
Extension to the other NERC data centres, requires:– online (or near-line) data.
– appropriate ingestion tools, appropriate mappings between specific discipline specific metadata and generic metadata.
– GRID enabling data centres.
– Decisions about policy and access.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
TheThe NERCNERC DataGridDataGridTheThe NERCNERC DataGridDataGrid
Bryan Lawrence, BADC
David Boyd, CLRC E-science
Kerstin Kleese, CLRC E-science
Roy Lowry, BODC
Dean Williams, PCMDI
Bob Drach, PCMDI
Mike Fiorino, PCMDI
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
Project Management
• Weekly workgroup meetings (teleconference and physical).
• Milestoning code and documentation reviews at quarterly intervals.
• Quarterly liaison with both US colleagues and other NERC projects (GODIVA, ClimatePrediction.com etc).
• Bi-Annual target-reprofiling.
• Professional project management at the code level:– Both RAL SSTD and RAL e-Science have considerable experience
managing and delivering large software projects.
• Two key tenets of management philosophy:– Build early, build often.
– Evolve from a working system.
TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid
The NDG: What will we do?Key components: BADC/BODC• Project Management.
• Ingestion tools for station data, oracle database data, and other (eg PP - includes tools based on ESML and Marine XML).
• Format conversion tools within CDAT.
• Ingestion! Migrate NERC Metadata gateway to WDSL/SOAP (Zoom?).
Key components: CLRC e-science• Globus Installation at all sites.
• Functional decomposition and interface definitions.
• Search database schema; search software python API, wrappers.
• Database Population. Logical to Physical File Manager.
• Amalgamating search API into – LAS (or successor) , VCDAT, metadata gateway.
• Add data retrieval interfaces into metadata gateway.
top related