the mashmydata project – combining and comparing...

3
The MashMyData project – Combining and comparing environmental science data on the web Background and Overview Environmental scientists use highly diverse sources of data, including in situ measurements, remotely-sensed information and the results of numerical simulations. The ability to access, visualize, combine and compare these datasets is at the core of scientific investigation (model validation, quality controlling of observations, data assimilation, ground-truthing of satellite datasets etc),but such tasks have hitherto been very difficult or impossible due to a fundamental lack of harmonization of data products. As a result, much valuable data remains underused. Currently, there are many web portals available which focus on visualizing environmental data (e.g Reading e-Science Centre’s “Godiva2” portal 1 ), but for comparing datasets and carrying out analyses users turn to desktop applications such as Matlab and IDL. These are powerful, but require users to get to grips with scripting and technical details. Until this time there have been a number of factors inhibiting the transfer of basic data analysis and comparison from desktop applications to web portals. These include the size of the datasets involved, their diversity of format and location, and security constraints. MashMyData is a one year NERC ‘Technology Proof of Concept’ programme which aims to address these needs by creating a system to allow environmental scientists to compare and combine diverse datasets over the web without the need to understand the low-level technical details of the data’s format or physical location. Users will be able to upload their own data and compare with professionally-curated datasets in data centres, respecting data privacy at all levels. Technical challenges The technical challenges involved in the MashMyData project have much of overlap with a number of important challenges in the wider e-Science community. These are key problems specifically in the environmental informatics community and their solutions will be very widely applicable in future. They can be broken down as follows: Dealing with data diversity. Data resides in different formats in different physical locations, accessed via different web service protocols. Furthermore, users can upload data in a variety of formats. It is very important to avoid writing specific data processing and visualization code for each individual dataset. The various datasets must therefore be exposed to the rest of the system in a consistent fashion. We have developed a Java implementation of the data model defined by the Climate Science Modelling Language (CSML), which applies international standards to describe a very large proportion of environmental science data (http://ndg.nerc.ac.uk/csml/). The key is that CSML uses a small number of “feature types” to model a large number of datasets. (Feature types are based on the data’s geometry and include grids, vertical profiles, timeseries, trajectories and points.) All visualization and analysis routines then operate upon these feature types, without knowledge of how or where the underlying data are stored. Data collocation is a key feature that is needed. This is very difficult to achieve if we don’t use standard means for representing geography and time. This is where adoption of open GIS standards really pays off for scientists. By providing a attractive tool that allows comparison of diverse data we further encourage scientists to adopt these standards. Accessing secure data, and the delegation problem. The MashMyData server accesses and processes data on behalf of the logged-on user. This requires that the user be able to delegate his or her authority to the web portal server. We are exploiting the considerable work done by the NERC DataGrid (NDG) team at CEDA, on secure data access services. The NDG team has developed and deployed solutions which have the functionality required in MashMyData, and are in the process of extending these to interact with authentication and authorisation paradigms from 1) the U.S. Earth System Grid (www.earthsystemgrid.org), and 2) the UK Shibboleth identity providers. This secures future compatibility with many secure data systems. Performing calculations remotely in a way that scales. In order to ensure future scalability, and to avoid large data transfers where possible, MashMyData demonstrates the processing of data on remote compute servers that are close to the data stores. We have employed the OGC Web Processing Service (WPS) as the interface to the remote compute servers. There is much current community interest in the use of WPS for this purpose, although the technology has rarely been employed in the environmental sciences. This work builds upon previous CEDA experience with the Defra-sponsored UK Climate Impacts Programme. One can’t always move large datasets wholesale and it is not always possible to move the computation to the data – hence data subsetting services such as OPeNDAP are also important in the project. 1 Jon Blower, Keith Haines, Adit Santokhee, Chunlei Liu, Godiva2: Interactive visualization of environmental data on the web, Phil. Trans. Roy. Soc. A, 367, 1035-9, 2009

Upload: others

Post on 01-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The MashMyData project – Combining and comparing ...esciencecentral.co.uk/docs/2010.AHM.MMD.pdf · The MashMyData project – Combining and comparing environmental science data

The MashMyData project – Combining and comparing environmentalscience data on the web

Background and OverviewEnvironmental scientists use highly diverse sources of data, including in situ measurements, remotely-sensed informationand the results of numerical simulations. The ability to access, visualize, combine and compare these datasets is at the coreof scientific investigation (model validation, quality controlling of observations, data assimilation, ground-truthing ofsatellite datasets etc),but such tasks have hitherto been very difficult or impossible due to a fundamental lack ofharmonization of data products. As a result, much valuable data remains underused.

Currently, there are many web portals available which focus on visualizing environmental data (e.g Reading e-ScienceCentre’s “Godiva2” portal1), but for comparing datasets and carrying out analyses users turn to desktop applications such asMatlab and IDL. These are powerful, but require users to get to grips with scripting and technical details. Until this timethere have been a number of factors inhibiting the transfer of basic data analysis and comparison from desktop applicationsto web portals. These include the size of the datasets involved, their diversity of format and location, and securityconstraints.

MashMyData is a one year NERC ‘Technology Proof of Concept’ programme which aims to address these needs bycreating a system to allow environmental scientists to compare and combine diverse datasets over the web without the needto understand the low-level technical details of the data’s format or physical location. Users will be able to upload their owndata and compare with professionally-curated datasets in data centres, respecting data privacy at all levels.

Technical challengesThe technical challenges involved in the MashMyData project have much of overlap with a number of important challengesin the wider e-Science community. These are key problems specifically in the environmental informatics community andtheir solutions will be very widely applicable in future. They can be broken down as follows:

Dealing with data diversity. Data resides in different formats in different physical locations, accessed via different webservice protocols. Furthermore, users can upload data in a variety of formats. It is very important to avoid writing specificdata processing and visualization code for each individual dataset. The various datasets must therefore be exposed to therest of the system in a consistent fashion. We have developed a Java implementation of the data model defined by theClimate Science Modelling Language (CSML), which applies international standards to describe a very large proportion ofenvironmental science data (http://ndg.nerc.ac.uk/csml/). The key is that CSML uses a small number of “feature types” tomodel a large number of datasets. (Feature types are based on the data’s geometry and include grids, vertical profiles,timeseries, trajectories and points.) All visualization and analysis routines then operate upon these feature types, withoutknowledge of how or where the underlying data are stored.

Data collocation is a key feature that is needed. This is very difficult to achieve if we don’t use standard means forrepresenting geography and time. This is where adoption of open GIS standards really pays off for scientists. By providing aattractive tool that allows comparison of diverse data we further encourage scientists to adopt these standards.

Accessing secure data, and the delegation problem. The MashMyData server accesses and processes data on behalf ofthe logged-on user. This requires that the user be able to delegate his or her authority to the web portal server. We areexploiting the considerable work done by the NERC DataGrid (NDG) team at CEDA, on secure data access services. TheNDG team has developed and deployed solutions which have the functionality required in MashMyData, and are in theprocess of extending these to interact with authentication and authorisation paradigms from 1) the U.S. Earth System Grid(www.earthsystemgrid.org), and 2) the UK Shibboleth identity providers. This secures future compatibility with manysecure data systems.

Performing calculations remotely in a way that scales. In order to ensure future scalability, and to avoid large datatransfers where possible, MashMyData demonstrates the processing of data on remote compute servers that are close to thedata stores. We have employed the OGC Web Processing Service (WPS) as the interface to the remote compute servers.There is much current community interest in the use of WPS for this purpose, although the technology has rarely beenemployed in the environmental sciences. This work builds upon previous CEDA experience with the Defra-sponsored UKClimate Impacts Programme. One can’t always move large datasets wholesale and it is not always possible to move thecomputation to the data – hence data subsetting services such as OPeNDAP are also important in the project.

1 Jon Blower, Keith Haines, Adit Santokhee, Chunlei Liu, Godiva2: Interactive visualization of environmental data onthe web, Phil. Trans. Roy. Soc. A, 367, 1035-9, 2009

Page 2: The MashMyData project – Combining and comparing ...esciencecentral.co.uk/docs/2010.AHM.MMD.pdf · The MashMyData project – Combining and comparing environmental science data

Enabling traceability and reproducibility. These are key requirements in the scientific domain, and are often poorlyconsidered in e-Science projects. It is extremely important for scientists to be able to trace the flow of data throughout theirwork and to be able to reproduce results and workflows. This is exactly the sort of consideration which has gone into thedesign of Newcastle University’s ‘e-Science Central’ software which we are employing in MashMyData.

Prototype solutionIn providing an initial prototype solution as part of this NERC proof-of-concept project we are adopting the approach oftackling two specific use cases which are described in the following section. One of the main achievements of the project isto effectively integrate a number of disparate web services and technologies into one coherent framework. The details ofthis framework are invisible to the user who simply sees the web portal through which they carry out operations and viewthe results. The project team view this model of service integration through standards-based technology as being extremelyimportant to the future of e-Science on the web. The web services and technologies which we are harnessing include:

• Newcastle University’s e-Science Central software (upload, data storage, workflows)• University of Liege’s DIVA-on-web service (interpolating geospatial point data)• Reading e-Science Centre’s ncWMS/Godiva2 Web Map Service (displaying gridded geospatial NetCDF data)• The Centre for Environmental Data Archival (CEDA) Web Processing Service (number crunching for compute-

intensive workflows)• A PostgreSQL/PostGIS database is used to store all the metadata. This allows for rapid processing of the datasets

in determining overlaps in space and time during mash-ups.

The relationship between these components is illustrated in figure 1 which shows the project architecture. The user’srequests are handled by the MashMyData web application which acts as a custom interface to e-Science Central. The lattersoftware is used to coordinate between the various other tools and services.

Use casesThe project has two test cases in the environmental sciences in order to clearly demonstrate the usefulness of the projectoutcomes to this community. Much consideration is also given to a more general environmental science usage scenario inorder to ensure that the project will provide solid standards-based foundations for future work in this area. This was deemedextremely important by the project team in order to avoid the final project outcome being ‘only’ a proof-of-conceptdemonstrator which could not be readily extended.

The first test case concerns the matching and intercomparison between ocean temperature proxy data obtained from thestudy of coccolithophore remains on the sea floor, and direct measurements of ocean temperature from in-situ instruments.This important new research has the potential to link fossil fuel emissions to ocean temperature changes via coccolithophorephysiology. The second test case is in atmospheric research being undertaken in the newly-formed National Centre for EarthObservation (NCEO). We are collaborating with scientists at the University of Reading, who are working on thequantification of precipitation forecast uncertainty in southern England using ensemble techniques. This research is essentialfor improving prediction of rainfall, including extreme events such as that which caused the Boscastle floods of 2004.

The system works in the following way from the user’s perspective:1. User logs onto web portal (using their account if they have one, or as a guest)2. User opts to load a dataset. They are given a choice of a) available community datasets, b) any secure datasets to

which they have access rights, and c) their own datasets which they have uploaded.3. The dataset is loaded in the map panel. This shows a colour-contoured image in the case of gridded datasets (e.g.

ocean model), or a selection of markers in the case of a point-based dataset.4. The user may click the dataset to query it. E.g. click the model image to reveal the value (or a timeseries if a date

range has been selected) at that point, or click a marker to reveal the value (or timeseries) from that observation.5. A second layer may be added by the user – they are offered a choice of available data products as in step 2.6. Now if the user queries a point in space or a marker as per step 4, they are shown a pair of values, or a timeseries

containing the values of both datasets.7. The user may click a button labelled ‘MASH!’, in which case they are offered a number of possible mash-up

workflows depending on the nature of the layers which are currently loaded.8. The mashup workflow results are returned. This could be a model layer showing the difference between, or

average of, two other layers, or a scatter plot of values from dataset one against dataset two, or an interpolated fieldbased on one or more sets of point observations.

Page 3: The MashMyData project – Combining and comparing ...esciencecentral.co.uk/docs/2010.AHM.MMD.pdf · The MashMyData project – Combining and comparing environmental science data

Fig. 1. MashMyData architecture.