informatics and computational challenges for satellite monitoring of global biodiversity mark...

Download Informatics and Computational Challenges for Satellite Monitoring of Global Biodiversity Mark SchildhauerRyan Pavlick NCEAS, UCSBNASA/JPL NASA/NCEAS Workshop,

If you can't read please download the document

Upload: paulina-simmons

Post on 25-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Informatics and Computational Challenges for Satellite Monitoring of Global Biodiversity Mark SchildhauerRyan Pavlick NCEAS, UCSBNASA/JPL NASA/NCEAS Workshop, Dec 10, 2014
  • Slide 2
  • Analytical challenges Ecology and Biodiversity Sciences: inherently multi-disciplinary: bio + earth critical, societally-relevant environmental questions typically not local at regional if not global scale analyses become far more robust and efficient with faster access to wider range and larger volumes of DATA 2 From: Halpern et al. A Global Map of Human Impact on Marine Science Ecosystems, Science 15 February 2008: DOI: 10.1126/science.1149345
  • Slide 3
  • Good news more and more data There is a growing deluge of environmental data to assist in these investigations
  • Slide 4
  • 4 Fundamental Problem: Big Data Ecological/biodiversity data are Big Data: globally distributed, voluminous highly heterogeneous in structure and content rapidly growing! i.e., the 3 Vs: Volume Variety Velocity
  • Slide 5
  • Informatics challenges Discovering and integrating data across scales micro (and nano) to global aligning heterogeneous schema and themes: land-use/land-cover, geology, soils, atmosphere, hydrology, oceanography genes to ecosystems human sciences: culture &traditions, demographics, economics, governance dealing with volume: TB PB ++ satellite images sensors (aerial and ground-based) observational data access and storage even GBs are problems at Desktop!!! Documenting effects of climate change on forest composition Large amounts of relevant data E.g., over 25,000 data sets are available in the Knowledge Network for Biocomplexity repository (KNB http://knb.ecoinormatic.org)http://knb.ecoinormatic.org 5
  • Slide 6
  • Environmental Data the status quo Distributed: stewarded by many groups, individuals Under-documented: sparsely and inconsistently documented; jargon and acronyms; critical details about data natural language (journals, white papers Inaccessible: varying degrees and mechanisms of presentation via FTP, Web, etc. Heterogeneous: broad range of relevant topics (semantics), lots of different data formats (structure), data access protocols (syntax), data models, etc.
  • Slide 7
  • Data collected by thousands of trained field scientists providing invaluable on-the-ground, in situ information: fine-grain detail on biodiversity but highly idiosyncratic approaches with methods, naming of measurements Also, there is the long tail of dark data* in ecology/biodiversity sciences * Heidorn, P Bryan. 2008 DOI: 10.1353/lib.0.0036
  • Slide 8
  • Ground-truthing & Observational Data AGGREGATORS are KEY: Plant Occurrences and Vegetation Plots BIEN, Turboveg, sPlot, CTFS, GBIF, Natural History Museum Collections, Map of Life Plant Functional Traits (PFT) TRY, BIEN 8
  • Slide 9
  • Ground-truthing & Observational Data AGGREGATORS are KEY: Sensor data NEON Genomic data iPlant Remote sensing and global climate data NASA DAACs, IPCC... others... and MANY independent, dark-tail, in situ data sets 9
  • Slide 10
  • Several Existing Resources eScience 201010
  • Slide 11
  • Geospatial Data Need better discovery, access to, and integration of remote-sensing data with ground- truthing and observational (in situ) data!! 11
  • Slide 12
  • Informatics Challenges Preservation Discovery/Integration Attribution 12
  • Slide 13
  • for Preservation: ARCHIVES Archives should be permanent, reliable, powerful, comprehensive, AND useful (usable) NSF DataNet program: data stewardship and interoperability; exploring models for sustainability federating major earth science data archives distributed framework (shared responsibility) API (new groups can participate, and are welcome!) Data, metadata, ontologies 13
  • Slide 14
  • for Discovery, Integration: SHARED KNOWLEDGE MODELS Consistency and rigor in terminology Standardized protocols, methods when possible Semantics approaches Ontologies for terminologies Ontologies to describe data schemas Machine-assisted discovery, reasoning, integration 14
  • Slide 15
  • Environmental Data the status quo Under-documented: sparsely and inconsistently documented; jargon and acronyms; critical details about data natural language (journals, white papers) measurements: MAT, MAR, LL, LMA, LNA, PET, PLNTHT, VPD, VSWIR techniques: SMLR, PSLR projects and models: LOPEX, ACCP, PROSPECT instruments: AVIRIS, CASI, HyspIRI
  • Slide 16
  • Advance consistency and rigor in terminology and data descriptions Standardize protocols, methods when possible Development Tasks: Ontologies for domains Ontologies to describe data schemas Mechanisms to bind data with Knowledge Models Machine-assisted discovery, reasoning, integration Observational data model as foundational template 16 Semantics approaches to support machine- processing of data
  • Slide 17
  • Metadata-based Data Integration Metadata standards are step in right direction Expose data in standard schema for transfer Dublin Core ISO 19115 (geospatial metadata) and OGC Darwin Core (biodiversity specimen metadata) EML (Ecological Metadata Language) GeoSciML Can map one format to another to resolve minor differences (but this gets arduous) And these still allow for terminological inconsistencies, and dont support well hierarchy, synonymy, complex relationships
  • Slide 18
  • Semantic Data Integration W3C Semantic standards-- RDF, OWL provide greater expressivity, formalization, enhanced search, reasoning Class/subclass subsumption Axioms/properties: reflexivity, transitivity, domain/rangy
  • Slide 19
  • Simple Darwin Core (2013)-- dwc:Occurrence dwc:Eventdcterms:Location detected_during to_taxon happened_at dwc:Identification dwc:Taxon basis_for dwc:MaterialSample documented_by derived_from basis_for Can formalize in RDF: leads to greater clarity of how concepts related; Conversion to triple format can enable basic graph traversal
  • Slide 20
  • RDFS-based inferencing ENVO:Tropical Broadleaf Forest Biome! ** BENEFITS: Enhanced searching along subsumption hierarchies (classes or properties) Formalized descriptions
  • Slide 21
  • Observational Data Model Implemented as an OWL-DL ontology Provides basic concepts for describing observations Specific extension points for domain-specific terms 21 Entity Characteristic Observation Measurement Protocol Standard + precision : decimal + method : anyType 1..1 * * * * 0..1 1..1 * * Value 1..1 * * Context ObservedEntity
  • Slide 22
  • Semantic annotation 22 Attribute mappings
  • Slide 23
  • 23 Open Open Science Scientists should communicate the data they collect and the models they create, to allow free and open access, and in ways that are intelligible, assessable and usable for other specialists in the same or linked fields wherever they are in the world. Where data justify it, scientists should make them available in an appropriate data repository. Science as an open enterprise, The Royal Society Science Policy Centre report 02/12
  • Slide 24
  • Why Open Science now? Technology is available to do it (Internet + Web + Semantics + FLOPS) Growing politicization of science: need for transparency Importance of large- scale/interdisciplinary science Efficiencies in re-using or sharing available data, code A return to fundamental premise of science: objective, repeatable, transparent, general 24
  • Slide 25
  • Open Science: Open Data: repositories (e.g. NASA DAACs, NSFs DataONE) Open Source: code and algorithms (e.g. Python, R) Open Access: journals (e.g PLoS) Open Notebook: blog++ (e.g. iPython) 25
  • Slide 26
  • Open Data Rapid, highly affordable access to ALL the data supporting scientific findings 26
  • Slide 27
  • Open Source Easy, fast, low-to-no (cost) barriers to languages, code, libraries/packages, algorithms, and frameworks for accomplishing analyses Multi-platform, scalable 27
  • Slide 28
  • Open Access (OA) Rapid, highly affordable access to the latest scientific findings Issues: Peer-review process Copyright (IP issues) Costs 28
  • Slide 29
  • Researchers still struggling to Discover relevant datasets Access and integrate these its getting more difficult as volume, diversity and complexity of data increase and Data Quality is always a concern!!! 29 The (sad) status quo
  • Slide 30
  • Steps towards global-scale, Open Biodiversity Science? Aggregators (coordinated, international): service providers who assemble and harmonize distributed data for the scientific community Better services and interfaces: requires standardization of metadata and semantics overcome limitations of desktop tools/frameworks all FOSS Cross-scale, cross data-type, integration genomic, organismal, observational/ecological, sensors, remote-sensing Must train researchers in the use of these new tools, data types, and frameworks!! 30