using e-infrastructures for biodiversity conservation - module 4
TRANSCRIPT
Using e-Infrastructures for Biodiversity Conservation
Gianpaolo Coro ISTI-CNR, Pisa, Italy
Module 4 - Outline
1. Data processing requirements by communities of practice
2. The D4Science Statistical Manager
3. Ecological modelling
D4ScienceD4Science is both a Data and a Computational e-Infrastructure
• Used by several Projects: i-Marine, EUBrazil OpenBio, ENVRI;
• Implements the notion of e-Infrastructure as-a-Service: it offers on demand access to data management services and computational facilities;
• Hosts several VREs for Fisheries Managers, Biologists, Statisticians…and Students.
D4Science - ResourcesLarge Set of Biodiversity and Taxonomic Datasets connected
A Network to distribute and access to Geospatial Data
Distributed Storage System to store datasets and documents
A Social Networkto share opinions and useful news
Algorithms for Biology-related experiments
Data Processing
1. Data processing requirements by communities of practice
2. The D4Science Statistical Manager
3. Ecological modelling
Some interests by communities of practice in Computational Statistics:
1. Repetition and validation of experiments
2. Exploitation of algorithms in several contexts
3. Hide the complexity of the calculations
4. Facilitate the management and the publication of the algorithms
Issues
…practically speaking, they search for:
1. Modular and pluggable solutions
2. Access by means of standard protocols
3. Hiding the complexity of parallel processing
4. Hiding the complexity of software management and provisioning
5. Active contribution with new algorithms and use cases
Issues
1. Data processing requirements by communities of practice
2. The D4Science Statistical Manager
3. Ecological modelling
The Statistical Manager is a set of web services that aim to:
• Help scientists in computational statistics experiments
• Supply precooked state-of-the-art algorithms as-a-Service
• Perform calculations by using Map-Reduce in a seamless way to the users
• Share input, results, parameters and comments with colleagues by means of Virtual Research Environment in the D4Science e-Infrastructure
Statistical Manager – Users’ View
StatisticalManager
D4ScienceComputational
FacilitiesSharing
Setup and execution
Open Platform Approach
External Computing
Facility
OGC WPS
Interface
People can contribute with:
• R scripts• Java programs• Linux programs• OGC-WPS services
The Statistical Manager allows to:
• Develop distributed computation in easy way (Statistical Manager Framework)
• Parallelize R Scripts without possibly changing the code
• Automatically produce a User Interface to perform experiments
• Reuse models and best practices developed by the community
• Connect external computational facilities via WPS OGC Standard
Statistical Manager – Developers’ View
Architecture
Internal Work
The Context: Resources and Sharing
Statistical Manager - Interface
Experiment Execution
Computations Check
Summary of the Input, Output and Parameters of the experiment
Data Space - Sharing and Import
100 Hosted Algorithms
Numbers
FishBase (US, CA, TW)GeomarNaturhistoriska riksmuseet: StartsidaAgrocampusAnonymous Individ-ualsINRAKing Abdullah Uni-versity of Science and TechnologyISTI
Users
2013 2014Avg Users per month 200 20100
Number of Algorithms 50 100
Number of contributing Organizations providing algorithms
2 CNR,
Geomar
7CNR,
Geomar,FIN,FAO,T2,IRD,
AgrocampusPublications 8 13Sum Impact
Factor 2.66 12.17
20121. L. Candela, G. Coro, P. Pagano, ”Supporting Tabular Data Characterization in a Large Scale Data Infrastructure by Lexical Matching Techniques”, In M. Agosti et al. (Eds.): IRCDL 2012, Communications in Computer
and Information Science Volume 354, pp. 21–32. Springer, Heidelberg (2012).
20132. R. Froese, J. Thorson, R. B. Reyes Jr. A Bayesian approach for estimating length-weight relationships in fishes. Journal of Applied Ichthyology. Volume 30, Issue 1, pages 78–85, 20133. G. Coro, P. Pagano, A. Ellenbroek, ”Combining Simulated Expert Knowledge with Neural Networks to Produce Ecological Niche Models for Latimeria chalumnae”, Ecological Modelling, DOI
10.1016/j.ecolmodel.2013.08.005, Ed. Elsevier.4. G. Coro, L. Fortunati, P. Pagano. Deriving Fishing Monthly Effort and Caught Species from Vessel Trajectories. Oceans 2013, Proceedings of MTS/IEEE.5. P. Pagano, G. Coro, D. Castelli, L. Candela, F. Sinibaldi, A. Manzi. Cloud Computing for Ecological Modeling in the D4Science Infrastructure. Proceedings of EGI Community Forum 2013.6. D. Castelli, P. Pagano, G. Coro, F. Sinibaldi, ”Modellazione della Nicchia Ecologica di Specie Marine (Marine Species Ecological Niche Modelling)”. In “Le Tecnologie del CNR per il Mare” (CNR Marine Technologies)
pp. 140, Ed. CNR (Roma, Italy).7. D. Castelli, P. Pagano, G. Coro, ”Variazioni Climatiche ed Effetto sulle Specie Marine (Climate Changes and Effect on Marine Species)”. In ”Le Tecnologie del CNR per il Mare” (CNR Marine Technologies) pp. 139, Ed.
CNR (Roma, Italy).8. D. Castelli, P. Pagano, G. Coro, ”Elaborazione di Dati Trasmessi da Pescherecci (Processing of fishing vessel transmitted information)”. In “Le Tecnologie del CNR per il Mare” (CNR Marine Technologies). pp. 133, Ed.
CNR (Roma, Italy).9. G. Coro, P. Pagano, A. Ellenbroek. Automatic Procedures to Assist in Manual Review of Marine Species Distribution Maps. To be published in M. Tomassini et al. (Eds.): International Conference on Adaptive and
Natural Computing Algorithms (ICANNGA’13), Springer, Heidelberg (2013).10. Candela L., Castelli D., Coro G., Pagano P., Sinibaldi F. Species distribution modeling in the cloud. In: Concurrency and Computation-Practice & Experience, Geoffrey C. Fox, David W. Walker (eds.). Wiley,11. Appeltans W., Pissierssens P., Coro G., Italiano A., Pagano P., Ellenbroek A., Webb T. Trendylyzer: a long-term trend analysis on biogeographic data. In: Bollettino di Geofisica Teorica e Applicata: an International
Journal of Earth Sciences, vol. 54 (Suppl.) pp. 203 - 205. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, 2013.
12. Coro G., Gioia A., Pagano P., Candela L. A service for statistical analysis of marine data in a distributed e-infrastructure. In: Bollettino di Geofisica Teorica e Applicata: an International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 68 - 70. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, 2013.
13. Castelli D., Pagano P., Candela L., Coro G. The iMarine data bonanza: improving data discovery and management through a hybrid data infrastructure. In: Bollettino di Geofisica Teorica e Applicata: an International Journal of Earth Sciences, vol. 54 (Suppl.) pp. 105 - 107. Supplement: IMDIS 2013 - International Conference on Marine Data and Information Systems, 23-25 September, Lucca (Italy). OGS - Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, 2013.
14. Coro G. A Lightweight Guide on Gibbs Sampling and JAGS. A Lightweight Guide on Gibbs Sampling and JAGS. Technical report, 2013.15. Vanden Berghe E., Bailly N., Aldemita C., Fiorellato F., Coro G., Ellenbroek A., Pagano P. BiOnym - a flexible workflow approach to taxon name matching. In: TDWG 2013 - Taxonomic Database Working Group 2013
(Firenze, 28-31 October 2013). 16. Coro G., Pagano P., Candela L. Providing Statistical Algorithms as-a-Service. In: TDWG 2013 - Taxonomic Database Working Group 2013 (Firenze, 28-31 October 2013).
201417. Candela L., Castelli D., Coro G., De Faveri F., Italiano A., Lelii L., Mangiacrapa F., Marioli V., Pagano P. Integrating Species Occurrence Databases to Facilitate Data Analysis. Approved for the Ecological Informatics
Journal, Elsevier 2014.18. Froese R, Coro G., Kleisner K., Demirel N. Revisiting Safe Biological Limits in Fisheries. Sumitted to the Fish and Fisheries Journal, Wiley 201419. Coro G., Candela L., Pagano P., Italiano A., Liccardo L. Parallelising the Execution of Native Data Mining Algorithms for Computational Biology. Submitted to Concurrency and Computation-Practice & Experience,
Wiley 2014.20. Coro G. , Pagano P., Ellenbroek A. Comparing Heterogeneous Distribution Maps for Marine Species. Submitted to GIScience & Remote Sensing, Taylor & Francis 2014.
201521. G. Coro, C. Magliozzi, A. Ellenbroek, P. Pagano, Improving data quality to build a robust distribution model for Architeuthis dux, Ecological Modelling, Volume 305, 10 June 2015, Pages 29-39, ISSN 0304-380022. G. Coro, C. Magliozzi, E. Vanden Berghe, N. Bailly, A. Ellenbroek, P. Pagano, Estimating absence locations of marine species from data of scientific surveys23. R. Froese, N. Demirel, G. Coro, K. Kleisner, H. Winker, Estimating Fisheries Reference Points from Catch and Resilience24. E. Vanden Berghe, N. Bailly, G. Coro, F. Fiorellato, C. Aldemita, A. Ellenbroek, P. Pagano. Retrieving taxa names from large biodiversity data collections using a flexible matching workflow25. G. Coro, C. Magliozzi, A. Ellenbroek, K. Kaschner, P. Pagano. Automatic classification of climate change effects on marine species distributions in 2050 using the AquaMaps model26. E. Trumpy, G. Coro, A. Manzella, P. Pagano, D. Castelli, P. Calcagno, A. Nador, T. Bragasson, S. Grellet. Building a European Geothermal Information Network using a
Publications around the Statistical Manager
1. Data processing requirements by communities of practice
2. The D4Science Statistical Manager
3. Ecological modelling
Niche Modelling
Scope: • characterize the environmental conditions that are suitable for the species to
subsist;• identify where suitable environment is distributed in geographical space;• estimate the actual and potential geographic distributions of a species.
Actual distribution: areas that are truly occupied by the speciesFundamental niche: the full range of abiotic conditions within which the species is viablePotential distribution: areas with abiotic conditions that fall within the fundamental niche
Niche Modelling and Absence and Presence Points
Approaches: Mechanistic models: incorporate physiological limits in a species tolerance to environmental conditions;Correlative models: automatically estimate the environmental conditions that are suitable for a species by relying on examples.
Presence points: occurrence records, i.e. places where the species has been observed in its habitat
Absence points: locations where the environment is considered unsuitable for the species. In many cases, absence points must be simulated (pseudo-absence points), because reliable data are rare.
Examples: Potential Distributions of the Coelacanth
Presence-only: MaxEnt Presence-only: GARP
Expert (semi-Mechanistic): AquaMaps
Presence\Absence: Artificial Neural Networks
Comparison between several approaches estimating the potential distribution of the Coelacanth.
The best depends on the quality of the data.Thus, cleaning operations are very important!
C-squares (concise spatial query and representation system):
• A system of geocodes that provides a basis for simple spatial indexing of geographic features
• Devised by Tony Rees of CSIRO Marine and Atmospheric Research
• A compact encoding of Latitude and Longitude and Resolution
Example:
C-square code: 3414:227:3 Resolution: 0.5°N,S,W,E limits: -42.5,-43.0,147.0,147.5
A useful converter: http://www.marine.csiro.au/marq/csq_builder.init
C-square codes
Contains information on:a) cell codesb) statistical cell properties (center, limits, and area);c) membership in relevant areas (FAO areas, EEZs or LMEs);d) physical attributes (depth, salinity or temperature);e) biological properties (e.g. primary production).
Data gathered from:Sea Around Us ProjectCSIROKansas Geological Survey
Compiled by:Kristin Kaschner & Jonathan Ready
HCAF (Half-degree Cells Authority File)
Contains information used for describing the environmental tolerance and preference of a species:
• distribution using FAO areas and bounding box• range of values per environmental parameter (min., preferred
min., preferred max., max.)
HSPEN (Half-degree Species Environmental Envelope)
Online experiment: the i-Marine Filtering Facilities
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
A Niche model relying on expert knowledge
Contains the assignment of a species to a half-degree cell and the corresponding probability of occurrence of the species in a given cell;
The assignment probability is the multiplicative equation of each of the environmental parameters (SST, salinity, prim. prod., sea ice concentration, distance to land).
HSPEC (Half-degree Species Assignment)
AquaMaps
Gadus morhua
A Presence-only species model that relies on expert knowledge about the species habitat• AquaMaps Suitable: estimates the Potential Distribution• AquaMaps Native: estimates the Actual Distribution
• Maps have 0.5 degrees resolution;• Expert knowledge is used in modelling the habitat parameters;• AquaMaps adopts mechanistic assumptions combined with an automatic estimation of
parameter values.
• “good cells” - within bounding box or known FAO areas• minimum of 10 “good cells” for needed for extracting parameters
Bounding box or FAO area limits serve as independent verification of the validity of occurrence records.
AquaMaps – Good Cells
Taken from: http://www.aquamaps.org/main/presentations/Part%20II%20-%20AquaMaps%20behind%20the%20scene.pdf
Global grid of 259,200 half degree cells
Good cells are used to derive the range of environmental parameters within the species’ native range.
AquaMaps – Extracting Environmental Parameters
Taken from: http://www.aquamaps.org/main/presentations/AquaMaps_General0908.pdf
• Depth ranges: typically from literature; depth estimate based on habitat description
• Min = 25th percentile - 1.5 * interquartile or absolute minimum in extracted data (whichever is greater)
• Max = 75th percentile + 1.5 * interquartile or absolute maximum in extracted data (whichever is greater)
• PrefMin = 10th percentile of observed variation in an environmental parameter
• PrefMax = 90th percentile of observed variation in an environmental parameter
• Surface values for species with min depth ≤ 200m
• Bottom values for species with min depth > 200m
The environmental envelopes describe tolerances of a species with respect to each environmental parameter.
AquaMaps – Environmental Envelopes
Taken from: http://www.aquamaps.org/main/presentations/AquaMaps_General0908.pdf
Predictor
Preferred min
Preferred max
Min Max
PMaxRe
lativ
e pr
obab
ility
of
occ
urre
nce
Pc = Pbathymetryc x PSSTc x Psalinityc x Pchl ac x PIceDistc x PLandDistc
Probabilities of species occurrence are generated by matching the species environmental envelope against local environmental conditions to determine relative suitability of a given area.
Probability of Occurrence
AquaMaps – Environmental Envelopes
Taken from: http://www.aquamaps.org/main/presentations/AquaMaps_General0908.pdf
The probability is calculated for each 0.5 cell
in the oceans.A color is associated to the probability values
AquaMaps – Probability
Pc = Pbathymetryc x PSSTc x Psalinityc
x Pchl ac x PIceDistc x PLandDistc
Online experiment: AquaMaps
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
What if Expert Knowledge was missing?
Artificial Neural Network
Presence/Absence Points examples
Probability (1/ 0)
• Learns from positive (presence) and negative (absence) examples (training mode);• Adapts the network weights to produce the correct outputs on the examples;• Produces probability values for new input (test mode).
Artificial Neural Networks Maps
Examples and Exercises: AquaMaps - Neural Networks
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
Climate change analysis
• HCAF Scenarios can be simulated by means of interpolation.
• Interpolation produces half-degree values between a start and an end date
• Once new HCAFs are available we can produce an HSPEC for each HCAF
Simulation of HCAF Scenarios
Climate Changes Effects on Species
Estimated impact of climate changes over 20 years on 11549 species.
Bioclimate HSpec
Overall occupancy in time
Online experiment: BioClimate Analysis
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
Grouping the occurrence points and the environmental features
of different species
• Group points by spatial distance or density• Detect outliers
Occurrence Points Clustering
DBScan acts on the points density
Parameters:• Epsilon = 10• Min Points = 2
Outliers
Density Clustering
XMeans
K = [20,30]Min Points = 2MaxIter=1000
KMeans
K = 24Min Points = 2MaxIter=1000MaxOptSteps = 1000
No Outliers Detected!
No Outliers Detected!
Distance Clustering
Online experiment: Clustering
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
Discovering similaritiesamong habitats
Similarity between habitatsHabitat Representativeness Score:• Measures the degree to which sampled habitats are representative for a certain
area of study;• Has been used for assessing the minimum number of surveys on a study area that
are needed to cover a good heterogeneity of species habitat variables.Can be used to:• Measure the similarity between the environmental features of two areas;• Assesses the quality of models and environmental features.
HRS=10.6
Habitat Representativeness
Score
A+P HRS 10.58
PHRS 10.61
Habitat Representativeness Score
Absence
Presence The HRS is too high -> all the maps can be unreliable and need expert validation
HRS is in [0;2] for each featureThe overall HRS is the sum of the HRSs of the environmental features
Habitat Representativeness Score for each Feature
HRS 10.58
mean depth in t.c. 1.90max depth in t.c. 0.87min depth in t.c. 0.04mean annual s surface temp 1.19mean annual s bottom temp 1.59mean salinity in t.c. 1.23mean bottom salinity in t.c. 0.44mean primary production 0.61annual ice concentration 0.71distance from land 0.46ocean area in t.c. 1.54
Presence, Absence
HRS 10.61
mean depth in t.c. 1.92max depth in t.c. 0.86min depth in t.c. 0.04mean annual s surface temp 1.13mean annual s bottom temp 1.56mean salinity in t.c. 1.29mean bottom salinity in t.c. 0.34mean primary production 0.64annual ice concentration 0.78distance from land 0.49ocean area in t.c. 1.55
The most representative feature is the minimum depth in a cell of 0.5 degrees
Presence only
Even in this case the most representative feature is the minimum depth in a cell of 0.5 degrees
Online experiment: Habitat Representativeness Score
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
Retrieving taxonomic information for a set of species
BiOnym
PreprocessingAnd
Parsing
A workflow approach to taxon name matching.
Accounts for:• Variations in the spelling and
interpretation of taxonomic names
• Combination of data from different sources
• Harmonization and reconciliation of Taxa names
Taxon Matcher 1
Taxon Matcher 2
Taxon Matcher n
PostProcessing
ReferenceSource(ASFIS)
ReferenceSource
(FISHBASE)
ReferenceSource
(WoRMS)
Raw Input String. E.g. Gadus morua Lineus 1758
Correct Transcriptions: E.g. Gadus morhua (Linnaeus, 1758)
…
ReferenceSource
(Other in DwC-A)
GSAy
GSAY
GSrAy
GSrAY
GSA
Complete matchStep RateGSAy 950GSAY 940GSrAy 930GSrAY 920GSA 910GSrA 900GSY 890GSrY 880SAy 870SAY 860SrAy 850SrAY 840GAy 830GAY 820…
Parentheses issue
Gender agreement issues
Gender agreement and parentheses issues
Year issues
GSAYear issues
Matcher Example - GSAy
GSY
GS
SrAy
Rest
Author issues, misspelling or wrongStep RateGSY 950GSAY 940GSrAy 930GSrAY 920GSA 910GSrA 900GSY 890GSrY 880SAy 870SAY 860SrAy 850SrAY 840GAy 830GAY 820…
Homonyms
Other combinations
Taxamatch
GAYVisual check
Matcher Example - GSAy
BiOnym - Output
Online experiment: BiOnym
https://i-marine.d4science.org/group/biodiversitylab/processing-tools